Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1691 commits
Select commit Hold shift + click to select a range
aec9674
[Core] Remove legacy input mapper/processor from V0 (#15686)
DarkLight1337 Apr 28, 2025
fa93cd9
[Model] Add Granite Speech Support (#16246)
alex-jw-brooks Apr 28, 2025
72c5b97
Update tpu_worker.py 's typo (#17288)
idouba Apr 28, 2025
fb1c933
Add missing class docstring for `PromptAdapterConfig` (#17302)
hmellor Apr 28, 2025
344e193
[Bugfix] Add missing `get_language_model` to new MLLMs (#17300)
DarkLight1337 Apr 28, 2025
3ad986c
[doc] update wrong model id (#17287)
reidliu41 Apr 28, 2025
889ebb2
[Misc] Minor typo/grammar in `platforms/interface.py` (#17307)
NickLucche Apr 28, 2025
8b464d9
[Misc] Clean up Qwen2.5-Omni code (#17301)
DarkLight1337 Apr 28, 2025
72dfe4c
[Docs] Add a security guide (#17230)
russellb Apr 28, 2025
f948869
Improve conversion from dataclass configs to argparse arguments (#17303)
hmellor Apr 28, 2025
b6dd32a
Make name of `compressed-tensors` quant method consistent across vLLM…
hmellor Apr 28, 2025
c7941cc
Explicitly explain quant method override ordering and ensure all over…
hmellor Apr 28, 2025
a0304dc
[Security] Don't bind tcp zmq socket to all interfaces (#17197)
russellb Apr 28, 2025
2c89cd9
[Chore] cleanup license indicators in light of SPDX (#17259)
aarnphm Apr 28, 2025
cc5befb
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata mus…
LucasWilkinson Apr 28, 2025
ed24620
[Bugfix] Fix moe weight losing all extra attrs after `process_weights…
charlifu Apr 28, 2025
dcbac4c
[Model] Qwen3 Dense FP8 Compat Fixes (#17318)
simon-mo Apr 28, 2025
6e74fd4
Support loading transformers models with named parameters (#16868)
Apr 28, 2025
8fc88d6
[Model] Add tuned triton fused_moe configs for Qwen3Moe (#17328)
mgoin Apr 28, 2025
cfe4532
[Benchmark] Add single turn MTBench to Serving Bench (#17202)
ekagra-ranjan Apr 28, 2025
506475d
[Optim] Compute multimodal hash only once per item (#17314)
DarkLight1337 Apr 29, 2025
86d9fc2
implement Structural Tag with Guidance backend (#17333)
mmoskal Apr 29, 2025
e136000
[V1][Spec Decode] Make Eagle model arch config driven (#17323)
ekagra-ranjan Apr 29, 2025
b4ac4fa
[model] make llama4 compatible with pure dense layers (#17315)
luccafong Apr 29, 2025
d6da8a8
[Bugfix] Fix `numel()` downcast in fused_layernorm_dynamic_per_token_…
r-barnes Apr 29, 2025
165cb56
Ignore `'<string>'` filepath (#17330)
zou3519 Apr 29, 2025
17eb306
[Bugfix] Add contiguous call inside rope kernel wrapper (#17091)
timzsu Apr 29, 2025
96e06e3
[Misc] Add a Jinja template to support Mistral3 function calling (#17…
chaunceyjiang Apr 29, 2025
cde384c
[Model] support MiniMax-VL-01 model (#16328)
qscqesze Apr 29, 2025
ebb3930
[Misc] Move config fields to MultiModalConfig (#17343)
DarkLight1337 Apr 29, 2025
bdb2cdd
[Misc]Use a platform independent interface to obtain the device attri…
jiangpeng36 Apr 29, 2025
193e78e
[Fix] Documentation spacing in compilation config help text (#17342)
Zerohertz Apr 29, 2025
4464109
[Build][Bugfix] Restrict setuptools version to <80 (#17320)
gshtras Apr 29, 2025
97cc872
[Model] Ignore rotary embed load for Cohere model (#17319)
ekagra-ranjan Apr 29, 2025
4a5e131
Update docs requirements (#17379)
hmellor Apr 29, 2025
890f104
[Doc] Fix QWen3MOE info (#17381)
jeejeelee Apr 29, 2025
00ee37e
[Bugfix] Clean up MiniMax-VL and fix processing (#17354)
DarkLight1337 Apr 29, 2025
40896bd
`pre-commit autoupdate` (#17380)
hmellor Apr 29, 2025
88ad9ec
[Frontend] Support `chat_template_kwargs` in `LLM.chat` (#17356)
DarkLight1337 Apr 29, 2025
900edfa
Transformers backend tweaks (#17365)
hmellor Apr 29, 2025
0ed27ef
Fix: Spelling of inference (#17387)
a2q1p Apr 29, 2025
2ef5d10
Improve literal dataclass field conversion to argparse argument (#17391)
hmellor Apr 29, 2025
24e6ad3
[V1] Remove num_input_tokens from attn_metadata (#17193)
heheda12345 Apr 29, 2025
a39203f
[Bugfix] add qwen3 reasoning-parser fix content is None when disable …
mofanke Apr 29, 2025
d3cf61b
fix gemma3 results all zero (#17364)
mayuyuace Apr 29, 2025
06ffc7e
[Misc][ROCm] Exclude `cutlass_mla_decode` for ROCm build (#17289)
tywuAMD Apr 29, 2025
608968b
Enabling multi-group kernel tests. (#17115)
Alexei-V-Ivanov-AMD Apr 29, 2025
56d64fb
[Docs] Propose a deprecation policy for the project (#17063)
russellb Apr 29, 2025
0c1c788
[Doc][Typo] Fixing label in new model requests link in overview.md (#…
casinca Apr 29, 2025
792595b
[TPU][V1][CI] Replace `python3 setup.py develop` with standard `pip i…
NickLucche Apr 29, 2025
b37685a
[CI] Uses Python 3.11 for TPU (#17359)
aarnphm Apr 29, 2025
08e15de
[CI/Build] Add retry mechanism for add-apt-repository (#17107)
reidliu41 Apr 29, 2025
2fa2a50
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference (#17397)
Isotr0py Apr 29, 2025
a6977db
Simplify (and fix) passing of guided decoding backend options (#17008)
hmellor Apr 29, 2025
0350809
Remove Falcon3 2x7B from CI (#17404)
hmellor Apr 29, 2025
c9c1b59
Fix: Python package installation for opentelmetry (#17049)
dilipgb Apr 29, 2025
70788bd
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE (#17211)
luyuzhe111 Apr 29, 2025
7489ec0
Remove Bamba 9B from CI (#17407)
hmellor Apr 29, 2025
34120f5
[V1][Feature] Enable Speculative Decoding with Structured Outputs (#1…
benchislett Apr 30, 2025
4055130
[release] Always git fetch all to get latest tag on TPU release (#17322)
khluu Apr 30, 2025
1c2bc7e
Truncation control for embedding models (#14776)
gmarinho2 Apr 30, 2025
2c4f59a
Update PyTorch to 2.7.0 (#16859)
huydhn Apr 30, 2025
13698db
Improve configs - `ModelConfig` (#17130)
hmellor Apr 30, 2025
d1f569b
Fix call to `logger.info_once` (#17416)
hmellor Apr 30, 2025
88fcf00
Fix some speculative decode tests with tl.dot (#17371)
huydhn Apr 30, 2025
a44c4f1
Support LoRA for Mistral3 (#17428)
mgoin Apr 30, 2025
6ed9f60
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue (#17298)
jikunshang Apr 30, 2025
ed6cfb9
[Hardware][Intel GPU] Upgrade to torch 2.7 (#17444)
jikunshang Apr 30, 2025
be633fb
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_…
chaunceyjiang Apr 30, 2025
54072f3
[MODEL ADDITION] Ovis2 Model Addition (#15826)
mlinmg Apr 30, 2025
ece5a8b
Make the _apply_rotary_emb compatible with dynamo (#17435)
houseroad Apr 30, 2025
1534d38
[Misc] Remove deprecated files (#17447)
chaunceyjiang Apr 30, 2025
d803786
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None (#15755)
lengrongfu Apr 30, 2025
a7d5b01
[TPU][V1][CI] Update regression test baseline for v6 CI (#17064)
NickLucche Apr 30, 2025
77073c7
[Core] Prevent side-channel attacks via cache salting (#17045)
dr75 Apr 30, 2025
0be6d05
[V1][Metrics] add support for kv event publishing (#16750)
alec-flowers Apr 30, 2025
2990cee
[Feature] The Qwen3 reasoning parser supports guided decoding (#17466)
chaunceyjiang Apr 30, 2025
39317cf
[Docs] Add command for running mypy tests from CI (#17475)
russellb Apr 30, 2025
da4e768
[Fix] Support passing args to logger (#17425)
aarnphm Apr 30, 2025
739e03b
[Bugfix] Fixed mistral tokenizer path when pointing to file (#17457)
psav Apr 30, 2025
947f2f5
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils (#17427)
russellb Apr 30, 2025
0b7e701
[Docs] Update optimization.md doc (#17482)
mgoin Apr 30, 2025
d586ddc
[BugFix] Fix authorization of openai_transcription_client.py (#17321)
hhy3 Apr 30, 2025
584f5fb
[Bugfix][ROCm] Restrict ray version due to a breaking release (#17480)
gshtras Apr 30, 2025
2ac74d0
[doc] add install tips (#17373)
reidliu41 Apr 30, 2025
42d9a2c
doc: fix bug report Github template formatting (#17486)
davidxia Apr 30, 2025
81ecf42
[v1][Spec Decode] Make sliding window compatible with eagle prefix ca…
heheda12345 Apr 30, 2025
200bbf9
Bump Compressed Tensors version to 0.9.4 (#17478)
rahul-tuli Apr 30, 2025
02bd654
[Misc] Rename Audios -> Audio in Qwen2audio Processing (#17507)
alex-jw-brooks May 1, 2025
dbc18e7
[CI][TPU] Skip Multimodal test (#17488)
lsy323 May 1, 2025
08fb558
[Bugfix][ROCm] Fix import error on ROCm (#17495)
gshtras May 1, 2025
1144a8e
[Bugfix] Temporarily disable gptq_bitblas on ROCm (#17411)
nlzy May 1, 2025
17b4d85
[CI][TPU] Skip structured outputs+spec decode tests on TPU (#17510)
mgoin May 1, 2025
aa4502e
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg (#17…
mgoin May 1, 2025
afb4429
[CI/Build] Reorganize models tests (#17459)
DarkLight1337 May 1, 2025
7ab643e
FIxing the AMD test failures caused by PR#16457 (#17511)
Alexei-V-Ivanov-AMD May 1, 2025
7a0a146
[Build] Require setuptools >= 77.0.3 for PEP 639 (#17389)
russellb May 1, 2025
90d0a54
[ROCm] Effort to reduce the number of environment variables in comman…
hongxiayang May 1, 2025
13cf6b6
[BugFix] fix speculative decoding memory leak when speculation is dis…
noah-yoshida May 1, 2025
3c3d767
[BugFix] Fix mla cpu - missing 3 required positional arguments (#17494)
LucasWilkinson May 1, 2025
26bc4bb
Avoid overwriting vllm_compile_cache.py (#17418)
youngkent May 1, 2025
fbefc8a
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() (#16506)
russellb May 1, 2025
015069b
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content (…
chaunceyjiang May 1, 2025
a257d9b
Improve configs - `ObservabilityConfig` (#17453)
hmellor May 1, 2025
86a1f67
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to sele…
tishizaki May 1, 2025
1903c0b
[Frontend] Show progress bar for adding requests (#17525)
DarkLight1337 May 1, 2025
48e925f
[Misc] Clean up test docstrings and names (#17521)
DarkLight1337 May 1, 2025
2007d4d
[FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X…
tjtanaa May 1, 2025
b74d888
Fix more broken speculative decode tests (#17450)
huydhn May 1, 2025
7169f87
[doc] add streamlit integration (#17522)
reidliu41 May 1, 2025
f5a3c65
[FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe conf…
tjtanaa May 1, 2025
98060b0
[Feature][Frontend]: Deprecate --enable-reasoning (#17452)
chaunceyjiang May 1, 2025
28566d7
[ROCm] remove unsupported archs from rocm triton flash-attention supp…
hongxiayang May 1, 2025
460a2b1
[torch.compile] Add torch inductor pass for fusing silu_and_mul with …
SageMoore May 1, 2025
7423cf0
[Misc] refactor example - cpu_offload_lmcache (#17460)
reidliu41 May 1, 2025
f2e7af9
[CI/Build] Remove `awscli` dependency (#17532)
DarkLight1337 May 1, 2025
6768ff4
Move the last arguments in `arg_utils.py` to be in their final groups…
hmellor May 1, 2025
88c8304
[Model] Refactor Ovis2 to support original tokenizer (#17537)
Isotr0py May 1, 2025
4acfa33
[ROCm] update installation guide to include build aiter from source i…
hongxiayang May 1, 2025
61c299f
[Misc]add configurable cuda graph size (#17201)
CXIAAAAA May 1, 2025
9b1769d
[Bugfix] Fix lint error (#17547)
DarkLight1337 May 1, 2025
811a6c0
[ROCM] Add gfx950 to the custom attention archs (#16034)
jpvillam-amd May 1, 2025
04f2cfc
Remove duplicate code from dbrx.py (#17550)
sstamenk May 1, 2025
173daac
[Bug]change the position of cuda_graph_sizes in dataclasses (#17548)
CXIAAAAA May 1, 2025
9b70e2b
[Misc][Tools][Benchmark] Publish script to auto tune server parameter…
Chenyaaang May 1, 2025
39c0813
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 (#17504)
zixi-qi May 1, 2025
24aebae
[Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 (#17541)
mgoin May 2, 2025
afb12e4
[Doc] note that not all unit tests pass on CPU platforms (#17554)
davidxia May 2, 2025
afcb3f8
[Attention] MLA move o_proj q_proj into cuda-graph region (#17484)
LucasWilkinson May 2, 2025
292fc59
[CI] Actually run tests/kv_transfer/test_disagg.py in CI (#17555)
mgoin May 2, 2025
b4003d1
Check if bitblas is installed during support check (#17572)
mgoin May 2, 2025
f89d0e1
[Misc] Continue refactoring model tests (#17573)
DarkLight1337 May 2, 2025
f192ca9
Fix PixtralHF missing spatial_merge_size (#17571)
mgoin May 2, 2025
109e15a
Add `pt_load_map_location` to allow loading to cuda (#16869)
jerryzh168 May 2, 2025
9e2de9b
[Bugifx] Remove TritonPlaceholder from sys.modules (#17317)
Isotr0py May 2, 2025
cc2a77d
[Core] [Bugfix] Add Input Embeddings (#15428)
qthequartermasterman May 2, 2025
c777df7
[BugFix] Fix Memory Leak (#17567)
robertgshaw2-redhat May 2, 2025
d754386
[Misc] Rename assets for testing (#17575)
DarkLight1337 May 2, 2025
b8b0859
add more pytorch related tests for torch nightly (#17422)
yangw-dev May 2, 2025
6d1479c
[doc] add the print result (#17584)
reidliu41 May 2, 2025
785d75a
Automatically tell users that dict args must be valid JSON in CLI (#1…
hmellor May 2, 2025
99404f5
[Security] Fix image hash collision (#17378)
DarkLight1337 May 2, 2025
868c546
Support W8A8 INT8 MoE for compressed-tensors (#16745)
mgoin May 2, 2025
3a500cd
[doc] miss result (#17589)
reidliu41 May 2, 2025
cb23495
[Misc] Clean up input processing (#17582)
DarkLight1337 May 2, 2025
4c33d67
[Bugfix] fix tmp_out and exp_sums dimensions (#17438)
hliuca May 2, 2025
0f87d8f
[BugFix][Attention] Fix sliding window attention in V1 giving incorre…
LucasWilkinson May 2, 2025
3e887d2
permute/unpermute kernel for moe optimization (#14568)
CalebDu May 2, 2025
182f40e
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)
Edwardf0t1 May 2, 2025
9352cdb
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning (#…
xw285cornell May 2, 2025
b90b085
[easy] Print number of needed GPUs in skip message (#17594)
zou3519 May 2, 2025
9b103a1
fix typo in logging (#17605)
ehartford May 3, 2025
3ec97e2
[release] Add command to clean up Docker containers/images in TPU rel…
khluu May 3, 2025
22c6f63
[Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 (#17603)
liangfu May 3, 2025
d47b605
Update test requirements to CUDA 12.8 (#17576)
22quinn May 3, 2025
e3d0a1d
[Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on…
rasmith May 3, 2025
87baebe
[Frontend][TPU] Add TPU default max-num-batched-tokens based on devic…
Chenyaaang May 3, 2025
c8386fa
[Build/CI] Upgrade CUTLASS to 3.9.1 (#17602)
tlrmchlsmth May 3, 2025
a928424
[Bugfix][ROCm] Using device_type because on ROCm the API is still tor…
gshtras May 3, 2025
887d7af
[Core] Gate `prompt_embeds` behind a feature flag (#17607)
DarkLight1337 May 3, 2025
f66f1e0
[Bugfix] Fix broken Qwen2.5-omni tests (#17613)
Isotr0py May 3, 2025
46fae69
[Misc] V0 fallback for `--enable-prompt-embeds` (#17615)
DarkLight1337 May 3, 2025
d6484ef
Add full API docs and improve the UX of navigating them (#17485)
hmellor May 4, 2025
2858830
[Bugfix] Prioritize dtype in root config before checking text config …
DarkLight1337 May 4, 2025
68e1ee0
[Bugfix][Easy] Fix whitespace in shm_broadcast.py logging (#17635)
tlrmchlsmth May 5, 2025
5394ad7
[Bugfix] fix KeyError on top logprobs are special tokens (#17637)
chaunceyjiang May 5, 2025
f62cad6
[Build/CI] Upgrade CUTLASS to 3.9.2 (#17641)
tlrmchlsmth May 5, 2025
1d0c9d6
[Kernel] some optimizations for dense marlin and moe marlin (#16850)
jinzhen-lin May 5, 2025
cc05b90
[Doc] Fix broken cuda installation doc rendering (#17654)
Isotr0py May 5, 2025
aea302b
Use git-path commit in hook (#17616)
thomasjpfan May 5, 2025
d3efde8
[Benchmarks] Remove invalid option under V1 engine (#17651)
russellb May 5, 2025
5ea5c51
[BugFix] Increase timeout for startup failure test (#17642)
njhill May 5, 2025
9765940
[TPU] Enable gemma3-27b with TP>1 on multi-chips. (#17335)
vanbasten23 May 5, 2025
5941e0b
[TPU][V1] Add support for top-logprobs (#17072)
NickLucche May 5, 2025
90bd2ae
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument (#17677)
varun-sundar-rabindranath May 6, 2025
98834fe
Update nm to rht in doc links + refine fp8 doc (#17678)
mgoin May 6, 2025
999328b
[Model] Add GraniteMoeHybrid 4.0 model (#17497)
s3woz May 6, 2025
edbf2d6
[easy] Fix logspam on PiecewiseBackend errors (#17138)
zou3519 May 6, 2025
dc47ba3
[Bugfix] Fixed prompt length for random dataset (#17408)
Xarbirus May 6, 2025
63ced7b
[Doc] Update notes for H2O-VL and Gemma3 (#17219)
DarkLight1337 May 6, 2025
6eae345
[Misc] Fix ScalarType float4 naming (#17690)
LucasWilkinson May 6, 2025
05e1f96
Fix `dockerfilegraph` pre-commit hook (#17698)
hmellor May 6, 2025
f9bc5a0
[Bugfix] Fix triton import with local TritonPlaceholder (#17446)
MengqingCao May 6, 2025
d419aa5
[V1] Enable TPU V1 backend by default (#17673)
mgoin May 6, 2025
a6fed02
[V1][PP] Support PP for MultiprocExecutor (#14219)
bigPYJ1151 May 6, 2025
cba31c4
[v1] AttentionMetadata for each layer (#17394)
heheda12345 May 6, 2025
175bda6
[Feat] Add deprecated=True to CLI args (#17426)
aarnphm May 6, 2025
0d11546
[Docs] Use gh-file to add links to tool_calling.md (#17709)
windsonsea May 6, 2025
aabcd2c
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCac…
heheda12345 May 6, 2025
7525d5f
[doc] Add RAG Integration example (#17692)
reidliu41 May 6, 2025
5b8c390
[Bugfix] Fix modality limits in vision language example (#17721)
DarkLight1337 May 6, 2025
6115b11
Make right sidebar more readable in "Supported Models" (#17723)
hmellor May 6, 2025
621ca2c
[TPU] Increase block size and reset block shapes (#16458)
bythew3i May 6, 2025
d456aea
[Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_…
dtransposed May 6, 2025
de906b9
[Bugfix] Fix for the condition to accept empty encoder inputs for mll…
gshtras May 6, 2025
2f925e5
[Kernel] Unified Triton kernel that doesn't distinguish between prefi…
tdoublep May 6, 2025
022afbe
Fix doc build performance (#17748)
hmellor May 7, 2025
ed3a1d2
[ROCm] fix num_stages for default moe config to avoid triton OutOfRes…
hongxiayang May 7, 2025
6de3e13
Add logging for torch nightly version (#17669)
yangw-dev May 7, 2025
18dd5e0
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Req…
cyang49 May 7, 2025
a17cef7
Removed unused marlin cuda code (#17684)
mgoin May 7, 2025
e50a1f1
[TPU] Add kernel test for moe_pallas (#17496)
mgoin May 7, 2025
950b711
Replace lm-eval bash script with pytest and use enforce_eager for fas…
mgoin May 7, 2025
8d84d83
[BugFix][Spec Decode] Fix hidden size mismatch between target and eag…
WoosukKwon May 7, 2025
822de7f
[Misc] Split model loader (#17712)
jeejeelee May 7, 2025
c3e9d50
[Misc] Use `apply_rotary_emb` from vllm_flash_attn for Qwen2-VL visio…
Isotr0py May 7, 2025
1a45a61
[Kernel] GGUF MoeVec kernel (#16780)
SzymonOzog May 7, 2025
f80ae5b
[Kernel] Use fused rmsnorm for some models like qwen3 series (#17735)
Eviannn May 7, 2025
ba7703e
[Misc] Remove qlora_adapter_name_or_path (#17699)
jeejeelee May 7, 2025
043e4c4
Add NeuronxDistributedInference support, Speculative Decoding, Dynami…
aws-satyajith May 7, 2025
8a15c26
[Frontend] Add missing chat templates for various MLLMs (#17758)
DarkLight1337 May 7, 2025
324a311
Fix test_memory_usage_no_spec (#17754)
sarckk May 7, 2025
98c89e1
Make key optional for rotary embedding (#17566)
sarckk May 7, 2025
7377dd0
[doc] update the issue link (#17782)
reidliu41 May 7, 2025
32aa74c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attentio…
gshtras May 7, 2025
1a6af14
Only depend on importlib-metadata for Python < 3.10 (#17776)
tiran May 7, 2025
be8ff88
[Bugfix] Fix Video IO error for short video (#17791)
Isotr0py May 7, 2025
646a31e
Fix and simplify `deprecated=True` CLI `kwarg` (#17781)
hmellor May 7, 2025
f98e307
[Bugfix] Fix missing lora name mapping for lora without prefix (#17793)
Isotr0py May 7, 2025
db593aa
[Quantization] Quark MXFP4 format loading (#16943)
BowenBao May 7, 2025
c20ef40
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend …
Akshat-Tripathi May 7, 2025
ed5272c
[BugFix] Avoid secondary missing `MultiprocExecutor.workers` error (#…
njhill May 7, 2025
d43f914
[Core][Feature] Input metadata dump on crash (#13407)
wallashss May 7, 2025
a8238bb
[Chore][Doc] uses model id determined from OpenAI client (#17815)
aarnphm May 8, 2025
66ab3b1
Don't call the venv `vllm` (#17810)
hmellor May 8, 2025
3d13ca0
[BugFix] Fix `--disable-log-stats` in V1 server mode (#17600)
njhill May 8, 2025
7ea2adb
[Core] Support full cuda graph in v1 (#16072)
chanh May 8, 2025
b2da14a
Improve exception reporting in MP engine (#17800)
vmarkovtsev May 8, 2025
c747d84
[Installation] OpenTelemetry version update (#17771)
Xarbirus May 8, 2025
998eea4
Only log non-default CLI args for online serving (#17803)
hmellor May 8, 2025
6930a41
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
russellb May 8, 2025
5a499e7
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs (#17071)
amd-hhashemi May 8, 2025
e515668
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for…
Akashcodes732 May 8, 2025
843b222
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU (#17648)
adobrzyn May 8, 2025
96722aa
[Frontend] Chat template fallbacks for multimodal models (#17805)
DarkLight1337 May 8, 2025
597051e
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 (#17715)
Ximingwang-09 May 8, 2025
39956ef
[Bugfix] Fix bad words for Mistral models (#17753)
qionghuang6 May 8, 2025
0a9bbaa
[Misc] support model prefix & add deepseek vl2 tiny fused moe config …
xsank May 8, 2025
ca04b97
[Bugfix] Fix tool call template validation for Mistral models (#17644)
RIckYuan999 May 8, 2025
a463555
[TPU] Fix the test_sampler (#17820)
bythew3i May 8, 2025
bb239a7
[Bugfix] Fix quark fp8 format loading on AMD GPUs (#12612)
fxmarty-amd May 8, 2025
a1e19b6
[Doc] Fix a typo in the file name (#17836)
DarkLight1337 May 8, 2025
f50dcb7
[Easy] Eliminate c10::optional usage in vllm/csrc (#17819)
houseroad May 8, 2025
53d0cb7
[Misc] add chatbox integration (#17828)
reidliu41 May 8, 2025
e4ca6e3
Fix transient dependency error in docs build (#17848)
hmellor May 8, 2025
015815f
[Bugfix] `use_fast` failing to be propagated to Qwen2-VL image proces…
DarkLight1337 May 8, 2025
a944f8e
[Misc] Delete LoRA-related redundancy code (#17841)
jeejeelee May 8, 2025
ec54d73
[CI] Fix test_collective_rpc (#17858)
russellb May 8, 2025
226a427
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging (#17860)
russellb May 8, 2025
a83a0f9
[Test] Attempt all TPU V1 tests, even if some of them fail. (#17334)
yarongmu-google May 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
Expand Down
5 changes: 3 additions & 2 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.233
value: 0.231
- name: "exact_match,flexible-extract"
value: 0.236
value: 0.22
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
model_name: "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.30
- name: "exact_match,flexible-extract"
value: 0.465
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
39 changes: 39 additions & 0 deletions .buildkite/lm-eval-harness/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# SPDX-License-Identifier: Apache-2.0
from pathlib import Path

import pytest


def pytest_addoption(parser):
parser.addoption(
"--config-list-file",
action="store",
help="Path to the file listing model config YAMLs (one per line)")
parser.addoption("--tp-size",
action="store",
default="1",
help="Tensor parallel size to use for evaluation")


@pytest.fixture(scope="session")
def config_list_file(pytestconfig, config_dir):
rel_path = pytestconfig.getoption("--config-list-file")
return config_dir / rel_path


@pytest.fixture(scope="session")
def tp_size(pytestconfig):
return pytestconfig.getoption("--tp-size")


def pytest_generate_tests(metafunc):
if "config_filename" in metafunc.fixturenames:
rel_path = metafunc.config.getoption("--config-list-file")
config_list_file = Path(rel_path).resolve()
config_dir = config_list_file.parent
with open(config_list_file, encoding="utf-8") as f:
configs = [
config_dir / line.strip() for line in f
if line.strip() and not line.startswith("#")
]
metafunc.parametrize("config_filename", configs)
59 changes: 0 additions & 59 deletions .buildkite/lm-eval-harness/run-tests.sh

This file was deleted.

38 changes: 12 additions & 26 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,62 +3,48 @@
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
* export LM_EVAL_TP_SIZE=4
* pytest -s test_lm_eval_correctness.py
pytest -s -v test_lm_eval_correctness.py \
--config-list-file=configs/models-small.txt \
--tp-size=1
"""

import os
from pathlib import Path

import lm_eval
import numpy
import numpy as np
import yaml

RTOL = 0.05
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)
RTOL = 0.08


def launch_lm_eval(eval_config):
def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get('trust_remote_code', False)

model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"tensor_parallel_size={tp_size}," \
f"enforce_eager=true," \
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"

results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")

return results


def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))
def test_lm_eval_correctness_param(config_filename, tp_size):
eval_config = yaml.safe_load(config_filename.read_text(encoding="utf-8"))

# Launch eval requests.
results = launch_lm_eval(eval_config)
results = launch_lm_eval(eval_config, tp_size)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
success = success and numpy.isclose(
success = success and np.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
Loading