Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
416 commits
Select commit Hold shift + click to select a range
01c2233
[Kernel] [V1] Fix performance regression for triton unified attention…
tdoublep May 15, 2025
566ec04
Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3"…
Alexei-V-Ivanov-AMD May 15, 2025
51ff154
Improve examples rendering in docs and GitHub (#18203)
hmellor May 15, 2025
2aa5470
[Frontend] Fix chat template content format detection (#18190)
schoennenbeck May 15, 2025
fadb8d5
[Bugfix]Change the exception thrown by call_hf_processor from Runtime…
Abatom May 15, 2025
9254052
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in…
tjtanaa May 15, 2025
e3f3aee
[Misc] Avoid cuda graph log when sizes still match (#18202)
NickLucche May 15, 2025
0b34593
Adding "AMD: Tensorizer Test" to amdproduction. (#18216)
Alexei-V-Ivanov-AMD May 15, 2025
8795eb9
[Bugfix] Fix test_eagle test (#18223)
luccafong May 15, 2025
c7852a6
[Build] Allow shipping PTX on a per-file basis (#18155)
LucasWilkinson May 15, 2025
4e1c6a0
[Bugfix] fix rotary embedding test for _get_padded_tensor_shape (#18229)
LucasWilkinson May 16, 2025
ee659e3
[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 …
kliuae May 16, 2025
f4937a5
[Model] vLLM v1 supports Medusa (#17956)
skylee-01 May 16, 2025
b18201f
Allow users to pass arbitrary JSON keys from CLI (#18208)
hmellor May 16, 2025
6b31c84
Throw better error for when running into k8s service discovery issue …
wseaton May 16, 2025
3d2779c
[Feature] Support Pipeline Parallism in torchrun SPMD offline inferen…
luccafong May 16, 2025
5c04bb8
[doc] fix multimodal example script (#18089)
davidxia May 16, 2025
67da572
[PERF] Speed up Qwen2.5-VL model by speed up rotary position embeddin…
vadiklyutiy May 16, 2025
5418176
[Misc] Add Ray Prometheus logger to V1 (#17925)
eicherseiji May 16, 2025
390ec88
[Misc] Consolidate Audio tests into multimodal common generation test…
Isotr0py May 16, 2025
e23564c
use ceil_div in cutlass block scaling shape check (#17918)
IwakuraRein May 16, 2025
a5f8c11
[Fix] Fix typo in `resolve_hf_chat_template` (#18259)
fxmarty-amd May 16, 2025
87d8714
[Model] Use autoweightloader for dbrx (#18251)
learner0810 May 16, 2025
d3d91b6
[Misc][MacOS] fix bfloat16 error (#18249)
reidliu41 May 16, 2025
1db4f47
[BugFix] Fix multi async save in MultiConnector (#18246)
njhill May 16, 2025
0ceaebf
[BugFix] Fix ordering of KVConnector finished send/rcv sets (#18211)
njhill May 16, 2025
aef94c6
[CI] Assign reviewer to mergify with changes to Tensorizer files (#18…
sangstar May 16, 2025
7fdfa01
[Sampler] Adapt to FlashInfer 0.2.3 sampler API (#15777)
abmfy May 16, 2025
e73b7df
[Bugfix] fix `an illegal memory access was encountered` of marlin ker…
jinzhen-lin May 16, 2025
fabe89b
[Spec Decode] Don't fall back to V0 when spec decoding is enabled (#1…
WoosukKwon May 16, 2025
fd195b1
[V1][P/D] Local attention optimization for NIXL (#18170)
mgoin May 17, 2025
55f1a46
Move cli args docs to its own page (#18228) (#18264)
strangiato May 17, 2025
60017dc
[Misc] reformat the collect-env output (#18285)
reidliu41 May 17, 2025
4ee4826
[BugFix] Correct max_model_len derivation from config.json for Mistra…
princepride May 17, 2025
3e0d435
[P/D][V1] Support dynamic loading of external KV connector implementa…
sdavidbd May 17, 2025
48ac2be
[Hardware][TPU] Optionally import for TPU backend (#18269)
lsy323 May 17, 2025
dcfe952
Update Dockerfile to build for Blackwell (#18095)
mgoin May 17, 2025
f880d42
Fixed build on ppc64le due to openssl conflicts (#18262)
npanpaliya May 17, 2025
9214e60
[Model] use AutoWeightsLoader for solar (#18113)
lengrongfu May 17, 2025
66e63e8
[MISC] fix typo (#18305)
andyxning May 17, 2025
9ab2c02
Support sequence parallelism combined with pipeline parallelism (#18243)
cascade812 May 17, 2025
1a8f68b
[doc] update reasoning doc (#18306)
reidliu41 May 18, 2025
908733a
[Model] Use sigmoid for single-label classification (#18313)
22quinn May 18, 2025
4fb349f
Fix copy-paste error in phi4mm image processing (#18315)
lifuhuang May 18, 2025
b6a6e7a
[Misc] add litellm integration (#18320)
reidliu41 May 18, 2025
d1211f8
[Doc] Add doc to explain the usage of Qwen3 thinking (#18291)
WangErXiao May 18, 2025
9da1095
[Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa…
wwl2755 May 19, 2025
221cfc2
Feature/vllm/input embedding completion api (#17590)
Nan2018 May 19, 2025
27d0952
[Misc] extract parser.parse_args() (#18323)
reidliu41 May 19, 2025
47fda6d
[Build] Supports CUDA 12.6 and 11.8 after Blackwell Update (#18316)
simon-mo May 19, 2025
275c5da
fix: Add type specifications for CLI arguments in tensorizer options …
googs1025 May 19, 2025
d637b96
[BugFix] [Vul] Add missing `usedforsecurity=False` in MD5 hashing to …
shaoyuyoung May 19, 2025
c5bb0eb
[Doc] Fix prompt embedding examples (#18350)
Potabk May 19, 2025
43b5f61
[Doc] Move input-related docs to Features (#18353)
DarkLight1337 May 19, 2025
1b15df2
[BugFix] Fix handling of num_computed_tokens with connector (#18232)
njhill May 19, 2025
6781af5
[Quantization] Pool model support bitsandbytes (#18087)
jeejeelee May 19, 2025
84ab4fe
[Doc] Fix typo (#18355)
eladsegal May 19, 2025
20d8ce8
[Frontend] add --quick option for vllm chat/complete (#18297)
reidliu41 May 19, 2025
e2ee1e8
[Feature]Add support for models quantized with AutoRound (#17850)
wenhuach21 May 19, 2025
7937c2f
Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) fo…
codesun8 May 19, 2025
8171221
[Misc] Fix typo (#18330)
Unprincess17 May 19, 2025
dc1440c
Neuron up mistral (#18222)
aws-satyajith May 19, 2025
258bf62
fix CUDA_check redefinition in #17918 (#18287)
luccafong May 19, 2025
d565e09
[neuron] fix authorization issue (#18364)
liangfu May 19, 2025
f07a673
[Misc] Allow `AutoWeightsLoader` to skip loading weights with specifi…
Isotr0py May 20, 2025
9609327
[Core] [Bugfix]: tensor parallel with prompt embeds (#18171)
Nan2018 May 20, 2025
d981396
[release] Change dockerhub username for TPU release (#18389)
khluu May 20, 2025
bca55b5
[Bugfix] fix adding bias twice in ipex GPTQ quantization (#18363)
rand-fly May 20, 2025
1b1e8e0
[doc] update env variable export (#18391)
reidliu41 May 20, 2025
6b35cb1
[Misc] Add LoRA code owner (#18387)
jeejeelee May 20, 2025
d6c86d0
Update cpu.txt (#18398)
princepride May 20, 2025
8684770
[CI] Add mteb testing to test the accuracy of the embedding model (#1…
noooop May 20, 2025
be48360
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure T…
wulipc May 20, 2025
8f55962
[Misc] refactor prompt embedding examples (#18405)
reidliu41 May 20, 2025
f4a8a37
[Minor] Rename quantization nvfp4 to modelopt_fp4 (#18356)
mgoin May 20, 2025
e1f5a71
[Model] use AutoWeightsLoader for bloom (#18300)
calvin0327 May 20, 2025
980a172
[Kernel] update comment for KV shape in unified triton attn (#18099)
haochengxia May 20, 2025
23baa21
fix:Build torch wheel inline rather than picking from nightly (#18351)
dilipgb May 20, 2025
3b17ea2
[TPU] Re-enable the Pallas MoE kernel (#18025)
mgoin May 21, 2025
0c15c2e
[Bugfix] config.head_dim is now explicitly set to None (#18432)
gshtras May 21, 2025
92247c5
[Bug] Fix moe_sum signature (#18440)
bnellnm May 21, 2025
ad0012a
Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processin…
DarkLight1337 May 21, 2025
d06dd72
[Bugfix][Failing Test] Fix nixl connector test when promt size < bloc…
wwl2755 May 21, 2025
cd8dfc6
[Misc] MultiConnector._connectors type (#18423)
NickLucche May 21, 2025
5d7f545
[Frontend] deprecate `--device` arg (#18399)
kebe7jun May 21, 2025
907f935
[V1] Fix general plugins not loaded in engine for multiproc (#18326)
sarckk May 21, 2025
107f5fc
[Misc] refactor disaggregated-prefill-v1 example (#18474)
reidliu41 May 21, 2025
61acfc4
[Bugfix][Failing Test] Fix test_events.py (#18460)
rabi May 21, 2025
eca1869
[MODEL] FalconH1 (#18406)
dhiaEddineRhaiem May 21, 2025
c154d89
[Doc] fix arg docstring in linear layers (#18410)
giantcroc May 21, 2025
c6c10ca
[Bugfix] Reduce moe_sum test size to avoid OOM (#18484)
bnellnm May 21, 2025
371376f
[Build] fix Dockerfile shell (#18402)
kebe7jun May 21, 2025
2b16104
[Misc] Update deprecation message for `--enable-reasoning` (#18404)
Zerohertz May 21, 2025
dd5fa7e
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1…
hyoon1 May 21, 2025
bb0a311
Revert "[v1] Support multiple KV cache groups in GPU model runner (#1…
markmc May 21, 2025
94d8ec8
[FEAT][ROCm] Upgrade AITER MLA v1 backend (#18338)
vllmellm May 21, 2025
1f07954
[Bugfix] Consistent ascii handling in tool parsers (#17704)
schoennenbeck May 21, 2025
20bd6f4
[FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e…
dhiaEddineRhaiem May 22, 2025
176d62e
[MISC] update project urls in pyproject.toml (#18519)
andyxning May 22, 2025
6e0fd34
[CI] Fix race condition with StatelessProcessGroup.barrier (#18506)
russellb May 22, 2025
acb54ca
Intialize io_thread_pool attribute in the beginning. (#18331)
rabi May 22, 2025
d022115
[Bugfix] Inconsistent token calculation compared to HF in llava famil…
cyr0930 May 22, 2025
cf5984b
[BugFix][DP] Send DP wave completion only from `dp_rank==0` (#18502)
njhill May 22, 2025
5179777
[Bugfix][Model] Make Olmo2Model weight loading return loaded weights …
2015aroras May 22, 2025
db5a29b
[Bugfix] Fix LoRA test (#18518)
jeejeelee May 22, 2025
23b67b3
[Doc] Fix invalid JSON in example args (#18527)
DarkLight1337 May 22, 2025
e2d7d31
[Neuron] Update Dockerfile.neuron to use latest neuron release (2.23)…
aws-satyajith May 22, 2025
ebed81f
Update default neuron config for speculation (#18274)
elaineyz May 22, 2025
fa72f9a
Order sequence ids + config update to support specifying custom quant…
elaineyz May 22, 2025
f6037d1
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure T…
wulipc May 22, 2025
a35a494
[Bugfix] Add kwargs to RequestOutput __init__ to be forward compatibl…
lk-chen May 22, 2025
ca86a7c
[CI/Build] Update bamba test model location (#18544)
hmellor May 22, 2025
7107502
[Doc] Support --stream arg in openai_completion_client.py script (#18…
googs1025 May 22, 2025
4e04ece
[Bugfix] Use random hidden states in dummy sampler run (#18543)
abmfy May 22, 2025
3f50523
[Doc] Add stream flag for chat completion example (#18524)
calvin0327 May 22, 2025
93f7167
[BugFix][CPU] Fix x86 SHM distributed module initialization (#18536)
bigPYJ1151 May 22, 2025
cb506ec
[Misc] improve Automatic Prefix Caching example (#18554)
reidliu41 May 22, 2025
54631f8
[Misc] Call `ndarray.tobytes()` directly instead of `ndarray.data.tob…
lgeiger May 22, 2025
1f3a120
[Bugfix] make `test_openai_schema.py` pass (#18224)
davidxia May 22, 2025
721fb9b
[Platform] Move platform check to right place (#18470)
wangxiyuan May 22, 2025
f8d2cc5
[Compile][Platform] Make PiecewiseBackend pluggable and extendable (#…
MengqingCao May 22, 2025
6e588da
[Build/CI] Fix CUDA 11.8 build (#17679)
tlrmchlsmth May 22, 2025
7b9d832
[Tool] Add NIXL installation script (#18172)
lk-chen May 22, 2025
a04720b
[V1][Spec Decode][Bugfix] Load quantize weights for EAGLE (#18290)
ekagra-ranjan May 22, 2025
c91fe7b
[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_…
wukaixingxp May 22, 2025
c32e249
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter seriali…
sangstar May 23, 2025
46791e1
[AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.…
rasmith May 23, 2025
04eb88d
Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. (#18569)
huachenheli May 23, 2025
c6b636f
[V1][Spec Decoding] Use model_loader.get_model() to load models (#18273)
markmc May 23, 2025
4b0da7b
Enable hybrid attention models for Transformers backend (#18494)
hmellor May 23, 2025
fae453f
[Misc] refactor: simplify input validation and num_requests handling …
googs1025 May 23, 2025
93ecb81
[BugFix] Increase TP execute_model timeout (#18558)
njhill May 23, 2025
e44d8ce
[Bugfix] Set `KVTransferConfig.engine_id` in post_init (#18576)
lk-chen May 23, 2025
583507d
[Spec Decode] Make EAGLE3 draft token ID mapping optional (#18488)
benchislett May 23, 2025
ed5d408
[Neuron] Remove bypass on EAGLEConfig and add a test (#18514)
elaineyz May 23, 2025
4be2255
[Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use …
tishizaki May 23, 2025
9c1baa5
[Misc] Replace `cuda` hard code with `current_platform` (#16983)
shen-shanshan May 23, 2025
60cad94
[Hardware] correct method signatures for HPU,ROCm,XPU (#18551)
andyxning May 23, 2025
4c61134
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal …
RonaldBXu May 23, 2025
71ea614
[Feature]Add async tensor parallelism using compilation pass (#17882)
cascade812 May 23, 2025
54af915
[Doc] Update quickstart and install for cu128 using `--torch-backend=…
mgoin May 23, 2025
b046cf7
[Feature][V1]: suupports cached_tokens in response usage (#18149)
chaunceyjiang May 23, 2025
d0bc2f8
[Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 c…
zzzyq May 23, 2025
a1fe24d
Migrate docs from Sphinx to MkDocs (#18145)
hmellor May 23, 2025
fbb13a2
Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for mul…
DarkLight1337 May 23, 2025
4ce64e2
[Bugfix][Model] Fix baichuan model loader for tp (#18597)
MengqingCao May 23, 2025
e493e48
[V0][Bugfix] Fix parallel sampling performance regression when guided…
shadeMe May 23, 2025
6526e05
Add myself as docs code owner (#18605)
hmellor May 23, 2025
7ab056c
[Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to …
yankay May 23, 2025
cd821ea
[CI] fix kv_cache_type argument (#18594)
andyxning May 23, 2025
38a95cb
[Doc] Fix indent of contributing to vllm (#18611)
Zerohertz May 23, 2025
2edb533
Replace `{func}` with mkdocs style links (#18610)
hmellor May 23, 2025
6dd51c7
[CI/Build] Fix V1 flag being set in entrypoints tests (#18598)
DarkLight1337 May 23, 2025
52fb23f
Fix examples with code blocks in docs (#18609)
hmellor May 23, 2025
6220f3c
[Bugfix] Fix transformers model impl ignored for mixtral quant (#18602)
tristanleclercq May 23, 2025
d4c2919
Include private attributes in API documentation (#18614)
hmellor May 23, 2025
2cd1fa4
[Misc] add Haystack integration (#18601)
reidliu41 May 23, 2025
1068556
[Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORT…
simon-mo May 23, 2025
5221815
[Doc] Fix markdown list indentation for MkDocs rendering (#18620)
Zerohertz May 23, 2025
022d8ab
[Doc] Use a different color for the announcement (#18616)
DarkLight1337 May 23, 2025
6a7988c
Refactor pplx init logic to make it modular (prepare for deepep) (#18…
youkaichao May 23, 2025
3d28ad3
Fix figures in design doc (#18612)
hmellor May 23, 2025
9520a98
[Docs] Change mkdocs to not use directory urls (#18622)
mgoin May 23, 2025
6550114
[v1] Redo "Support multiple KV cache groups in GPU model runner (#179…
heheda12345 May 23, 2025
8ddd1cf
[Doc] fix list formatting (#18624)
davidxia May 23, 2025
273cb3b
[Doc] Fix top-level API links/docs (#18621)
DarkLight1337 May 23, 2025
15b45ff
[Doc] Avoid documenting dynamic / internal modules (#18626)
DarkLight1337 May 23, 2025
371f7e4
[Doc] Fix broken links and unlinked docs, add shortcuts to home sideb…
DarkLight1337 May 23, 2025
2628a69
[V1] Support Deepseek MTP (#18435)
YaoJiayi May 23, 2025
1645b60
Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI (#1…
huydhn May 23, 2025
0ddf88e
[CI] Enable test_initialization to run on V1 (#16736)
mgoin May 23, 2025
7d92164
[Doc] Update references to doc files (#18637)
DarkLight1337 May 23, 2025
f203673
[ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to co…
pavanimajety May 23, 2025
4fc1bf8
[Bugfix] Migrate to REGEX Library to prevent catastrophic backtrackin…
Crucifixion-Fxl May 23, 2025
2b10ba7
[Bugfix][Nixl] Fix Preemption Bug (#18631)
robertgshaw2-redhat May 23, 2025
45ab403
config.py: Clarify that only local GGUF checkpoints are supported. (#…
MathieuBordere May 24, 2025
ec82c3e
FIX MOE issue in AutoRound format (#18586)
wenhuach21 May 24, 2025
d55e446
[V1][Spec Decode] Small refactors to improve eagle bookkeeping perfor…
zixi-qi May 24, 2025
441dc63
[Frontend] improve vllm serve --help display (#18643)
reidliu41 May 24, 2025
a859320
[Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditiona…
Nalkey May 24, 2025
c1e4a40
[V1][Spec Decode] Support multi-layer eagle draft model (#18030)
zixi-qi May 24, 2025
07458a5
[Doc] Update README links, mark external links (#18635)
DarkLight1337 May 24, 2025
e77dc4b
[MISC][pre-commit] Add pre-commit check for triton import (#17716)
MengqingCao May 24, 2025
ef1dd68
[Doc] Fix indentation problems in V0 Paged Attention docs (#18659)
DarkLight1337 May 24, 2025
6d166a8
[Doc] Add community links (#18657)
DarkLight1337 May 24, 2025
2cd4d58
[Model] use AutoWeightsLoader for gpt2 (#18625)
ztang2370 May 24, 2025
1cb194a
[Doc] Reorganize user guide (#18661)
DarkLight1337 May 24, 2025
2e67057
[CI/Build] `chmod +x` to `cleanup_pr_body.sh` (#18650)
DarkLight1337 May 24, 2025
4ceafb6
[MISC] typo fix and clean import (#18664)
andyxning May 24, 2025
b9018a3
[BugFix] Fix import error for fused_moe (#18642)
wangxiyuan May 24, 2025
2807271
[CI] enforce import regex instead of re (#18665)
aarnphm May 24, 2025
9ea7f1a
fix(regression): clone from reference items (#18662)
aarnphm May 24, 2025
b554ab7
[CI/Build] fix permission denied issue (#18645)
reidliu41 May 24, 2025
6825d9a
[BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Dec…
WoosukKwon May 25, 2025
7891fdf
[V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_mod…
eicherseiji May 25, 2025
6c6dcd8
[MISC] correct signature for LoaderFunction (#18670)
andyxning May 25, 2025
cebc22f
[Misc]Replace `cuda` hard code with `current_platform` in Ray (#14668)
noemotiovon May 25, 2025
6ab681b
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE (#18655)
MengqingCao May 25, 2025
75f8175
[VLM] Initialize video input support for InternVL models (#18499)
Isotr0py May 25, 2025
6393454
Speed up the `kernels/quantization/` tests (#18669)
mgoin May 25, 2025
44073a7
[BUGFIX] catch subclass first for try...except (#18672)
andyxning May 25, 2025
503f848
[Misc] Reduce logs on startup (#18649)
DarkLight1337 May 25, 2025
624b77a
[doc] fix broken links (#18671)
reidliu41 May 25, 2025
279f854
[doc] improve readability (#18675)
reidliu41 May 25, 2025
f2faac7
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environme…
zzzyq May 25, 2025
35be8fa
[CI/build] fix no regex (#18676)
reidliu41 May 25, 2025
3a886bd
[Misc] small improve (#18680)
reidliu41 May 25, 2025
57fd13a
[Bugfix] Fix profiling dummy data for Pixtral (#18677)
DarkLight1337 May 25, 2025
6071e98
[Core][Multimodal] Convert PIL Image to array without data copy when …
lgeiger May 25, 2025
fba0642
[CI/Build][Doc] Update `gte-Qwen2-1.5B-instruct` usage (#18683)
DarkLight1337 May 26, 2025
8820821
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation …
zhaohaidao May 26, 2025
abd4030
refactor: simplify request handler, use positive condition check for …
googs1025 May 26, 2025
561b77a
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode (#6357)
maxdebayser May 26, 2025
4ea62c0
[CI] add missing argument (#18694)
andyxning May 26, 2025
4b7740a
[GH] Add issue template for reporting CI failures (#18696)
DarkLight1337 May 26, 2025
65523a0
[Doc] Fix issue template format (#18699)
DarkLight1337 May 26, 2025
61a45e7
[Bugfix] Fix Mistral-format models with sliding window (#18693)
DarkLight1337 May 26, 2025
38b13df
[CI/Build] Replace `math.isclose` with `pytest.approx` (#18703)
DarkLight1337 May 26, 2025
5a2c76c
[CI] fix dump_input for str type (#18697)
andyxning May 26, 2025
6d68030
[Model] Add support for YARN in NemotronNAS models (#18427)
Naveassaf May 26, 2025
0877750
[CI/Build] Split pooling and generation extended language models test…
Isotr0py May 26, 2025
e76be06
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test …
ldurejko May 26, 2025
0665e29
[Misc] add AutoGen integration (#18712)
reidliu41 May 26, 2025
243eb91
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM …
YanWuHao May 26, 2025
9553fdb
[Doc] Improve API docs (#18713)
DarkLight1337 May 26, 2025
82e2339
[Doc] Move examples and further reorganize user guide (#18666)
DarkLight1337 May 26, 2025
a869bac
[Bugfix] Fix Llama GGUF initialization (#18717)
DarkLight1337 May 26, 2025
e7523c2
[V1][Sampler] Improve performance of FlashInfer sampling by sampling …
lgeiger May 26, 2025
27bebcd
Convert `examples` to `ruff-format` (#18400)
hmellor May 26, 2025
0eebd74
[Model][Gemma3] Simplify image input validation (#18710)
lgeiger May 27, 2025
1f88dbd
[Misc] improve web section group title display (#18684)
reidliu41 May 27, 2025
1f1b1bc
[V1][Quantization] Add CUDA graph compatible v1 GGUF support (#18646)
Isotr0py May 27, 2025
b50602d
[Model][Gemma3] Cast image pixel values already on CPU (#18732)
lgeiger May 27, 2025
d260f79
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels. (#18271)
vllmellm May 27, 2025
25a817f
[Doc] Update OOT model docs (#18742)
DarkLight1337 May 27, 2025
753944f
[Doc] Update reproducibility doc and example (#18741)
DarkLight1337 May 27, 2025
fc6d0c2
[Misc] improve docs (#18734)
reidliu41 May 27, 2025
a547aeb
feat(rocm-support): support mamba2 on rocm (#18565)
almersawi May 27, 2025
bbd9a84
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the …
ldurejko May 27, 2025
4693a34
[Doc] cleanup deprecated flag for doc (#18715)
calvin0327 May 27, 2025
c24b157
Minor fix about MooncakeStoreConnector (#18721)
maobaolong May 27, 2025
e0f0ff8
[Build] fix cpu build missing libtbbmalloc.so (#18744)
kebe7jun May 27, 2025
6881107
[BUG FIX] minicpm (#18739)
huangyuxiang03 May 27, 2025
a68e293
[Doc] Convert Sphinx directives ( `{class}`, `{meth}`, `{attr}`, ...…
Zerohertz May 27, 2025
4318c05
[CI/Build] Remove imports of built-in `re` (#18750)
DarkLight1337 May 27, 2025
06a0338
[V1][Metrics] Add API for accessing in-memory Prometheus metrics (#17…
markmc May 27, 2025
aaa4ac1
Disable prefix cache by default for benchmark (#18639)
cascade812 May 27, 2025
6b6d496
optimize get_kv_cache_torch_dtype (#18531)
chunxiaozheng May 27, 2025
696259c
[Core] Automatically cast multi-modal input dtype (#18756)
DarkLight1337 May 27, 2025
5873877
[Bugfix] Mistral tool calling when content is list (#18729)
mgoin May 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
20 changes: 12 additions & 8 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 400))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
with zipfile.ZipFile(zip_file, "r") as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
Expand All @@ -28,14 +28,18 @@ def check_wheel_size(directory):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print(
f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB)."
)
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB).")
print(
f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB)."
)
return 0


Expand All @@ -45,4 +49,4 @@ def check_wheel_size(directory):
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
sys.exit(check_wheel_size(directory))
4 changes: 2 additions & 2 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Llama-3.2-1B-Instruct-FP8 -b "auto" -l 1319 -f 5 -t 1
model_name: "RedHatAI/Llama-3.2-1B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.335
- name: "exact_match,flexible-extract"
value: 0.323
limit: 1319
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2.5-1.5B-Instruct -b auto -l 1319 -f 5 -t 1
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.54
- name: "exact_match,flexible-extract"
value: 0.59
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.47
- name: "exact_match,flexible-extract"
value: 0.64
limit: 1319
num_fewshot: 5
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
8 changes: 2 additions & 6 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Qwen2.5-1.5B-Instruct.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
16 changes: 10 additions & 6 deletions .buildkite/lm-eval-harness/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,14 @@ def pytest_addoption(parser):
parser.addoption(
"--config-list-file",
action="store",
help="Path to the file listing model config YAMLs (one per line)")
parser.addoption("--tp-size",
action="store",
default="1",
help="Tensor parallel size to use for evaluation")
help="Path to the file listing model config YAMLs (one per line)",
)
parser.addoption(
"--tp-size",
action="store",
default="1",
help="Tensor parallel size to use for evaluation",
)


@pytest.fixture(scope="session")
Expand All @@ -33,7 +36,8 @@ def pytest_generate_tests(metafunc):
config_dir = config_list_file.parent
with open(config_list_file, encoding="utf-8") as f:
configs = [
config_dir / line.strip() for line in f
config_dir / line.strip()
for line in f
if line.strip() and not line.startswith("#")
]
metafunc.parametrize("config_filename", configs)
26 changes: 15 additions & 11 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,22 @@


def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get('trust_remote_code', False)
model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={tp_size}," \
f"enforce_eager=true," \
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"
trust_remote_code = eval_config.get("trust_remote_code", False)
model_args = (
f"pretrained={eval_config['model_name']},"
f"tensor_parallel_size={tp_size},"
f"enforce_eager=true,"
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code}"
)
results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")
batch_size="auto",
)
return results


Expand All @@ -42,9 +45,10 @@ def test_lm_eval_correctness_param(config_filename, tp_size):
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
success = success and np.isclose(
ground_truth, measured_value, rtol=RTOL)
print(
f"{task['name']} | {metric['name']}: "
f"ground_truth={ground_truth} | measured={measured_value}"
)
success = success and np.isclose(ground_truth, measured_value, rtol=RTOL)

assert success
Original file line number Diff line number Diff line change
Expand Up @@ -65,18 +65,18 @@ def read_markdown(file):


def results_to_json(latency, throughput, serving):
return json.dumps({
'latency': latency.to_dict(),
'throughput': throughput.to_dict(),
'serving': serving.to_dict()
})
return json.dumps(
{
"latency": latency.to_dict(),
"throughput": throughput.to_dict(),
"serving": serving.to_dict(),
}
)


if __name__ == "__main__":

# collect results
for test_file in results_folder.glob("*.json"):

with open(test_file) as f:
raw_result = json.loads(f.read())

Expand Down Expand Up @@ -120,7 +120,8 @@ def results_to_json(latency, throughput, serving):
for perc in [10, 25, 50, 75, 90, 99]:
# Multiply 1000 to convert the time unit from s to ms
raw_result.update(
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]})
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
)
raw_result["avg_latency"] = raw_result["avg_latency"] * 1000

# add the result to raw_result
Expand Down Expand Up @@ -153,26 +154,27 @@ def results_to_json(latency, throughput, serving):
serving_results = pd.DataFrame.from_dict(serving_results)
throughput_results = pd.DataFrame.from_dict(throughput_results)

raw_results_json = results_to_json(latency_results, throughput_results,
serving_results)
raw_results_json = results_to_json(
latency_results, throughput_results, serving_results
)

# remapping the key, for visualization purpose
if not latency_results.empty:
latency_results = latency_results[list(
latency_column_mapping.keys())].rename(
columns=latency_column_mapping)
latency_results = latency_results[list(latency_column_mapping.keys())].rename(
columns=latency_column_mapping
)
if not serving_results.empty:
serving_results = serving_results[list(
serving_column_mapping.keys())].rename(
columns=serving_column_mapping)
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
columns=serving_column_mapping
)
if not throughput_results.empty:
throughput_results = throughput_results[list(
throughput_results_column_mapping.keys())].rename(
columns=throughput_results_column_mapping)
throughput_results = throughput_results[
list(throughput_results_column_mapping.keys())
].rename(columns=throughput_results_column_mapping)

processed_results_json = results_to_json(latency_results,
throughput_results,
serving_results)
processed_results_json = results_to_json(
latency_results, throughput_results, serving_results
)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
Expand All @@ -184,38 +186,39 @@ def results_to_json(latency, throughput, serving):
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}"
)

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
tablefmt='pipe',
showindex=False)
serving_md_table = tabulate(serving_results,
headers='keys',
tablefmt='pipe',
showindex=False)
throughput_md_table = tabulate(throughput_results,
headers='keys',
tablefmt='pipe',
showindex=False)
latency_md_table = tabulate(
latency_results, headers="keys", tablefmt="pipe", showindex=False
)
serving_md_table = tabulate(
serving_results, headers="keys", tablefmt="pipe", showindex=False
)
throughput_md_table = tabulate(
throughput_results, headers="keys", tablefmt="pipe", showindex=False
)

# document the result
with open(results_folder / "benchmark_results.md", "w") as f:

results = read_markdown("../.buildkite/nightly-benchmarks/" +
"performance-benchmarks-descriptions.md")
results = read_markdown(
"../.buildkite/nightly-benchmarks/"
+ "performance-benchmarks-descriptions.md"
)
results = results.format(
latency_tests_markdown_table=latency_md_table,
throughput_tests_markdown_table=throughput_md_table,
serving_tests_markdown_table=serving_md_table,
benchmarking_results_in_json_string=processed_results_json)
benchmarking_results_in_json_string=processed_results_json,
)
f.write(results)

# document benchmarking results in json
with open(results_folder / "benchmark_results.json", "w") as f:

results = latency_results.to_dict(
orient='records') + throughput_results.to_dict(
orient='records') + serving_results.to_dict(orient='records')
results = (
latency_results.to_dict(orient="records")
+ throughput_results.to_dict(orient="records")
+ serving_results.to_dict(orient="records")
)
f.write(json.dumps(results))
15 changes: 6 additions & 9 deletions .buildkite/nightly-benchmarks/scripts/download-tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,12 @@ def main(model, cachedir):

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Download and save Hugging Face tokenizer")
parser.add_argument("--model",
type=str,
required=True,
help="Name of the model")
parser.add_argument("--cachedir",
type=str,
required=True,
help="Directory to save the tokenizer")
description="Download and save Hugging Face tokenizer"
)
parser.add_argument("--model", type=str, required=True, help="Name of the model")
parser.add_argument(
"--cachedir", type=str, required=True, help="Directory to save the tokenizer"
)

args = parser.parse_args()
main(args.model, args.cachedir)
Loading