Skip to content
This repository was archived by the owner on Sep 4, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
420 commits
Select commit Hold shift + click to select a range
775f00f
[Speculative Decoding] Test refactor (#8317)
LiuXiaoxuanPKU Sep 11, 2024
d394787
Pixtral (#8377)
patrickvonplaten Sep 11, 2024
3fd2b0d
Bump version to v0.6.1 (#8379)
simon-mo Sep 11, 2024
a65cb16
[MISC] Dump model runner inputs when crashing (#8305)
comaniac Sep 12, 2024
f842a7a
[misc] remove engine_use_ray (#8126)
youkaichao Sep 12, 2024
b71c956
[TPU] Use Ray for default distributed backend (#8389)
WoosukKwon Sep 12, 2024
b6c75e1
Fix the AMD weight loading tests (#8390)
mgoin Sep 12, 2024
5a60699
[Bugfix]: Fix the logic for deciding if tool parsing is used (#8366)
tomeras91 Sep 12, 2024
1bf2dd9
[Gemma2] add bitsandbytes support for Gemma2 (#8338)
blueyo0 Sep 12, 2024
295c473
[Misc] Raise error when using encoder/decoder model with cpu backend …
kevin314 Sep 12, 2024
42ffba1
[Misc] Use RoPE cache for MRoPE (#8396)
WoosukKwon Sep 12, 2024
7de49aa
[torch.compile] hide slicing under custom op for inductor (#8384)
youkaichao Sep 12, 2024
520ca38
[Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399)
ywang96 Sep 12, 2024
e56bf27
[Bugfix] Fix InternVL2 inference with various num_patches (#8375)
Isotr0py Sep 12, 2024
c6202da
[Model] Support multiple images for qwen-vl (#8247)
alex-jw-brooks Sep 12, 2024
8a23e93
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instanc…
lnykww Sep 12, 2024
1f0c75a
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423)
vegaluisjose Sep 12, 2024
f2e263b
[Bugfix] Offline mode fix (#8376)
joerunde Sep 12, 2024
a6c0f36
[multi-step] add flashinfer backend (#7928)
SolitaryThinker Sep 12, 2024
551ce01
[Core] Add engine option to return only deltas or final output (#7381)
njhill Sep 12, 2024
0198772
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427)
alexm-redhat Sep 12, 2024
c163694
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix cac…
ywang96 Sep 12, 2024
b61bd98
[CI/Build] Disable multi-node test for InternVL2 (#8428)
ywang96 Sep 12, 2024
d31174a
[Hotfix][Pixtral] Fix multiple images bugs (#8415)
patrickvonplaten Sep 12, 2024
a480939
[Bugfix] Fix weight loading issue by rename variable. (#8293)
wenxcs Sep 12, 2024
360ddbd
[Misc] Update Pixtral example (#8431)
ywang96 Sep 13, 2024
8f44a92
[BugFix] fix group_topk (#8430)
dsikka Sep 13, 2024
5ec9c0f
[Core] Factor out input preprocessing to a separate class (#7329)
DarkLight1337 Sep 13, 2024
40c3965
[Bugfix] Mapping physical device indices for e2e test utils (#8290)
ShangmingCai Sep 13, 2024
3f79bc3
[Bugfix] Bump fastapi and pydantic version (#8435)
DarkLight1337 Sep 13, 2024
8427550
[CI/Build] Update pixtral tests to use JSON (#8436)
DarkLight1337 Sep 13, 2024
6821020
[Bugfix] Fix async log stats (#8417)
alexm-redhat Sep 13, 2024
ba77527
[bugfix] torch profiler bug for single gpu with GPUExecutor (#8354)
SolitaryThinker Sep 13, 2024
acda0b3
bump version to v0.6.1.post1 (#8440)
simon-mo Sep 13, 2024
9b4a3b2
[CI/Build] Enable InternVL2 PP test only on single node (#8437)
Isotr0py Sep 13, 2024
cab69a1
[doc] recommend pip instead of conda (#8446)
youkaichao Sep 13, 2024
06311e2
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442)
jeejeelee Sep 13, 2024
a246912
[misc][ci] fix quant test (#8449)
youkaichao Sep 13, 2024
ecd7a1d
[Installation] Gate FastAPI version for Python 3.8 (#8456)
DarkLight1337 Sep 13, 2024
0a4806f
[plugin][torch.compile] allow to add custom compile backend (#8445)
youkaichao Sep 13, 2024
a84e598
[CI/Build] Reorganize models tests (#7820)
DarkLight1337 Sep 13, 2024
f57092c
[Doc] Add oneDNN installation to CPU backend documentation (#8467)
Isotr0py Sep 13, 2024
18e9e1f
[HotFix] Fix final output truncation with stop string + streaming (#8…
njhill Sep 13, 2024
9ba0817
bump version to v0.6.1.post2 (#8473)
simon-mo Sep 13, 2024
8517252
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
jikunshang Sep 13, 2024
1ef0d2e
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
charlifu Sep 14, 2024
8a0cf1d
[Model] support minicpm3 (#8297)
SUDA-HLT-ywfang Sep 14, 2024
a36e070
[torch.compile] fix functionalization (#8480)
youkaichao Sep 14, 2024
47790f3
[torch.compile] add a flag to disable custom op (#8488)
youkaichao Sep 14, 2024
50e9ec4
[TPU] Implement multi-step scheduling (#8489)
WoosukKwon Sep 14, 2024
3724d5f
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by upda…
chrisociepa Sep 15, 2024
fc990f9
[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kern…
Isotr0py Sep 15, 2024
a091e2d
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
ElizaWszola Sep 16, 2024
837c196
[Frontend] Expose revision arg in OpenAI server (#8501)
lewtun Sep 16, 2024
acd5511
[BugFix] Fix clean shutdown issues (#8492)
njhill Sep 16, 2024
781e3b9
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506)
sasha0552 Sep 16, 2024
5d73ae4
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270)
ProExpertProg Sep 16, 2024
2759a43
[doc] update doc on testing and debugging (#8514)
youkaichao Sep 16, 2024
47f5e03
[Bugfix] Bind api server port before starting engine (#8491)
kevin314 Sep 16, 2024
5478c4b
[perf bench] set timeout to debug hanging (#8516)
simon-mo Sep 16, 2024
5ce45eb
[misc] small qol fixes for release process (#8517)
simon-mo Sep 16, 2024
cca6164
[Bugfix] Fix 3.12 builds on main (#8510)
joerunde Sep 17, 2024
546034b
[refactor] remove triton based sampler (#8524)
simon-mo Sep 17, 2024
1c1bb38
[Frontend] Improve Nullable kv Arg Parsing (#8525)
alex-jw-brooks Sep 17, 2024
ee2bcea
[Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521)
ywang96 Sep 17, 2024
99aa4ed
[torch.compile] register allreduce operations as custom ops (#8526)
youkaichao Sep 17, 2024
cbdb252
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change …
ruisearch42 Sep 17, 2024
1b6de83
[Benchmark] Support sample from HF datasets and image input for bench…
Isotr0py Sep 17, 2024
1009e93
[Encoder decoder] Add cuda graph support during decoding for encoder-…
sroy745 Sep 17, 2024
9855b99
[Feature][kernel] tensor parallelism with bitsandbytes quantization (…
chenqianfzh Sep 17, 2024
a54ed80
[Model] Add mistral function calling format to all models loaded with…
patrickvonplaten Sep 17, 2024
56c3de0
[Misc] Don't dump contents of kvcache tensors on errors (#8527)
njhill Sep 17, 2024
98f9713
[Bugfix] Fix TP > 1 for new granite (#8544)
joerunde Sep 17, 2024
fa0c114
[doc] improve installation doc (#8550)
youkaichao Sep 17, 2024
09deb47
[CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520)
alexeykondrat Sep 17, 2024
8110e44
[Kernel] Change interface to Mamba causal_conv1d_update for continuou…
tlrmchlsmth Sep 17, 2024
95965d3
[CI/Build] fix Dockerfile.cpu on podman (#8540)
dtrifiro Sep 18, 2024
e351572
[Misc] Add argument to disable FastAPI docs (#8554)
Jeffwan Sep 18, 2024
6ffa3f3
[CI/Build] Avoid CUDA initialization (#8534)
DarkLight1337 Sep 18, 2024
9d104b5
[CI/Build] Update Ruff version (#8469)
aarnphm Sep 18, 2024
7c7714d
[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#…
alexm-redhat Sep 18, 2024
a8c1d16
[Core] *Prompt* logprobs support in Multi-step (#8199)
afeldman-nm Sep 18, 2024
d65798f
[Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)
russellb Sep 18, 2024
e18749f
[Model] Support Solar Model (#8386)
shing100 Sep 18, 2024
b3195bc
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
gshtras Sep 18, 2024
db9120c
[Kernel] Change interface to Mamba selective_state_update for continu…
tlrmchlsmth Sep 18, 2024
d9cd78e
[BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572)
njhill Sep 18, 2024
0d47bf3
[Bugfix] add `dead_error` property to engine client (#8574)
joerunde Sep 18, 2024
4c34ce8
[Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
tlrmchlsmth Sep 19, 2024
3118f63
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata const…
sroy745 Sep 19, 2024
02c9afa
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer"…
ywang96 Sep 19, 2024
c52ec5f
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616)
KuntaiDu Sep 19, 2024
855c8ae
[MISC] remove engine_use_ray in benchmark_throughput.py (#8615)
jikunshang Sep 19, 2024
76515f3
[Frontend] Use MQLLMEngine for embeddings models too (#8584)
njhill Sep 19, 2024
9cc373f
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attentio…
charlifu Sep 19, 2024
e42c634
[Core] simplify logits resort in _apply_top_k_top_p (#8619)
hidva Sep 19, 2024
ea4647b
[Doc] Add documentation for GGUF quantization (#8618)
Isotr0py Sep 19, 2024
9e99407
Create SECURITY.md (#8642)
simon-mo Sep 19, 2024
6cb748e
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that…
alexeykondrat Sep 19, 2024
de6f90a
[Misc] guard against change in cuda library name (#8609)
bnellnm Sep 19, 2024
18ae428
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference (#8571)
garg-amit Sep 20, 2024
9e5ec35
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetad…
SolitaryThinker Sep 20, 2024
260d40b
[Core] Support Lora lineage and base model metadata management (#6315)
Jeffwan Sep 20, 2024
3b63de9
[Model] Add OLMoE (#7922)
Muennighoff Sep 20, 2024
2940afa
[CI/Build] Removing entrypoints/openai/test_embedding.py test from RO…
alexeykondrat Sep 20, 2024
b28298f
[Bugfix] Validate SamplingParam n is an int (#8548)
saumya-saran Sep 20, 2024
035fa89
[Misc] Show AMD GPU topology in `collect_env.py` (#8649)
DarkLight1337 Sep 20, 2024
2874bac
[Bugfix] Config got an unexpected keyword argument 'engine' (#8556)
Juelianqvq Sep 20, 2024
b4e4eda
[Bugfix][Core] Fix tekken edge case for mistral tokenizer (#8640)
patrickvonplaten Sep 20, 2024
7c8566a
[Doc] neuron documentation update (#8671)
omrishiv Sep 20, 2024
7f9c890
[Hardware][AWS] update neuron to 2.20 (#8676)
omrishiv Sep 20, 2024
0f961b3
[Bugfix] Fix incorrect llava next feature size calculation (#8496)
zyddnys Sep 20, 2024
0057894
[Core] Rename `PromptInputs` and `inputs`(#8673)
DarkLight1337 Sep 21, 2024
d4bf085
[MISC] add support custom_op check (#8557)
jikunshang Sep 21, 2024
0455c46
[Core] Factor out common code in `SequenceData` and `Sequence` (#8675)
DarkLight1337 Sep 21, 2024
0faab90
[beam search] add output for manually checking the correctness (#8684)
youkaichao Sep 21, 2024
71c6049
[Kernel] Build flash-attn from source (#8245)
ProExpertProg Sep 21, 2024
5e85f4f
[VLM] Use `SequenceData.from_token_counts` to create dummy data (#8687)
DarkLight1337 Sep 21, 2024
4dfdf43
[Doc] Fix typo in AMD installation guide (#8689)
Imss27 Sep 21, 2024
ec4aaad
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x…
rasmith Sep 21, 2024
9dc7c6c
[dbrx] refactor dbrx experts to extend FusedMoe class (#8518)
divakar-amd Sep 21, 2024
d66ac62
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (…
tlrmchlsmth Sep 21, 2024
13d88d4
[Bugfix] Refactor composite weight loading logic (#8656)
Isotr0py Sep 22, 2024
0e40ac9
[ci][build] fix vllm-flash-attn (#8699)
youkaichao Sep 22, 2024
06ed281
[Model] Refactor BLIP/BLIP-2 to support composite model loading (#8407)
DarkLight1337 Sep 22, 2024
8ca5051
[Misc] Use NamedTuple in Multi-image example (#8705)
alex-jw-brooks Sep 22, 2024
ca2b628
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (#8703)
ji-huazhong Sep 22, 2024
5b59532
[Model][VLM] Add LLaVA-Onevision model support (#8486)
litianjian Sep 22, 2024
c6bd70d
[SpecDec][Misc] Cleanup, remove bonus token logic. (#8701)
LiuXiaoxuanPKU Sep 22, 2024
d4a2ac8
[build] enable existing pytorch (for GH200, aarch64, nightly) (#8713)
youkaichao Sep 22, 2024
92ba7e7
[misc] upgrade mistral-common (#8715)
youkaichao Sep 22, 2024
3dda7c2
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when buildin…
tlrmchlsmth Sep 23, 2024
57a0702
[Bugfix] Fix CPU CMake build (#8723)
ProExpertProg Sep 23, 2024
d23679e
[Bugfix] fix docker build for xpu (#8652)
yma11 Sep 23, 2024
9b8c8ba
[Core][Frontend] Support Passing Multimodal Processor Kwargs (#8657)
alex-jw-brooks Sep 23, 2024
e551ca1
[Hardware][CPU] Refactor CPU model runner (#8729)
Isotr0py Sep 23, 2024
3e83c12
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model…
bigPYJ1151 Sep 23, 2024
a79e522
[Model] Support pp for qwen2-vl (#8696)
liuyanyi Sep 23, 2024
f2bd246
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use …
janimo Sep 23, 2024
ee5f34b
[CI/Build] use setuptools-scm to set __version__ (#4738)
dtrifiro Sep 23, 2024
86e9c8d
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GP…
LucasWilkinson Sep 23, 2024
9b0e3ec
[Kernel][LoRA] Add assertion for punica sgmv kernels (#7585)
jeejeelee Sep 23, 2024
b05f5c9
[Core] Allow IPv6 in VLLM_HOST_IP with zmq (#8575)
russellb Sep 23, 2024
5f7bb58
Fix typical acceptance sampler with correct recovered token ids (#8562)
jiqing-feng Sep 23, 2024
1a2aef3
Add output streaming support to multi-step + async while ensuring Req…
alexm-redhat Sep 23, 2024
530821d
[Hardware][AMD] ROCm6.2 upgrade (#8674)
hongxiayang Sep 24, 2024
88577ac
Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728)
sroy745 Sep 24, 2024
0250dd6
re-implement beam search on top of vllm core (#8726)
youkaichao Sep 24, 2024
3185fb0
Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to…
simon-mo Sep 24, 2024
b8747e8
[MISC] Skip dumping inputs when unpicklable (#8744)
comaniac Sep 24, 2024
3f06bae
[Core][Model] Support loading weights by ID within models (#7931)
petersalas Sep 24, 2024
8ff7ced
[Model] Expose Phi3v num_crops as a mm_processor_kwarg (#8658)
alex-jw-brooks Sep 24, 2024
cc4325b
[Bugfix] Fix potentially unsafe custom allreduce synchronization (#8558)
hanzhi713 Sep 24, 2024
a928ded
[Kernel] Split Marlin MoE kernels into multiple files (#8661)
ElizaWszola Sep 24, 2024
2529d09
[Frontend] Batch inference for llm.chat() API (#8648)
aandyw Sep 24, 2024
72fc97a
[Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (#8748)
LucasWilkinson Sep 24, 2024
2467b64
[CI/Build] fix setuptools-scm usage (#8771)
dtrifiro Sep 24, 2024
1e7d5c0
[misc] soft drop beam search (#8763)
youkaichao Sep 24, 2024
13f9f7a
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768)
jeejeelee Sep 25, 2024
01b6f9e
[Core][Bugfix] Support prompt_logprobs returned with speculative deco…
tjohnson31415 Sep 25, 2024
6da1ab6
[Core] Adding Priority Scheduling (#5958)
apatke Sep 25, 2024
6e0c9d6
[Bugfix] Use heartbeats instead of health checks (#8583)
joerunde Sep 25, 2024
ee777d9
Fix test_schedule_swapped_simple in test_scheduler.py (#8780)
sroy745 Sep 25, 2024
b452247
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (#8776)
sasha0552 Sep 25, 2024
fc3afc2
Fix tests in test_chunked_prefill_scheduler which fail with BlockMana…
sroy745 Sep 25, 2024
e3dd069
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicp…
zifeitong Sep 25, 2024
c239536
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (#8770)
Isotr0py Sep 25, 2024
3e073e6
[Bugfix] load fc bias from config for eagle (#8790)
sohamparikh Sep 25, 2024
1ac3de0
[Frontend] OpenAI server: propagate usage accounting to FastAPI middl…
agt Sep 25, 2024
3368c3a
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (#8767)
darthhexx Sep 25, 2024
8fae5ed
[Misc] Fix minor typo in scheduler (#8765)
wooyeonlee0 Sep 25, 2024
1c04644
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 up…
hongxiayang Sep 25, 2024
300da09
[Kernel] Fullgraph and opcheck tests (#8479)
bnellnm Sep 25, 2024
c6f2485
[[Misc]] Add extra deps for openai server image (#8792)
jeejeelee Sep 25, 2024
0c4d2ad
[VLM][Bugfix] internvl with num_scheduler_steps > 1 (#8614)
DefTruth Sep 25, 2024
28e1299
rename PromptInputs and inputs with backward compatibility (#8760)
DarkLight1337 Sep 25, 2024
64840df
[Frontend] MQLLMEngine supports profiling. (#8761)
Abatom Sep 25, 2024
873edda
[Misc] Support FP8 MoE for compressed-tensors (#8588)
mgoin Sep 25, 2024
4f1ba08
Revert "rename PromptInputs and inputs with backward compatibility (#…
simon-mo Sep 25, 2024
770ec60
[Model] Add support for the multi-modal Llama 3.2 model (#8811)
heheda12345 Sep 25, 2024
e2c6e0a
[Doc] Update doc for Transformers 4.45 (#8817)
ywang96 Sep 25, 2024
7193774
[Misc] Support quantization of MllamaForCausalLM (#8822)
mgoin Sep 25, 2024
bfa692e
chore: add fork OWNERS
z103cb Apr 30, 2024
b96ffe3
add ubi Dockerfile
dtrifiro May 21, 2024
acbab07
Dockerfile.ubi: remove references to grpc/protos
dtrifiro May 21, 2024
d54bfce
Dockerfile.ubi: use vllm-tgis-adapter
dtrifiro May 28, 2024
8065d82
gha: add sync workflow
dtrifiro Jun 3, 2024
a5b5eb0
Dockerfile.ubi: use distributed-executor-backend=mp as default
dtrifiro Jun 10, 2024
cab1bac
Dockerfile.ubi: remove vllm-nccl workaround
dtrifiro Jun 13, 2024
7c65254
Dockerfile.ubi: add missing requirements-*.txt bind mounts
dtrifiro Jun 18, 2024
9f11204
add triton CustomCacheManger
tdoublep May 29, 2024
8914a32
gha: sync-with-upstream workflow create PRs as draft
dtrifiro Jun 19, 2024
1f8b826
add smoke/unit tests scripts
dtrifiro Jun 19, 2024
1beb801
extras: exit unit tests on err
dtrifiro Jun 20, 2024
b102823
Dockerfile.ubi: misc improvements
dtrifiro May 28, 2024
c5e1313
update OWNERS
dtrifiro Jun 21, 2024
ff1cc50
Dockerfile.ubi: use tensorizer (#64)
prashantgupta24 Jun 25, 2024
129720a
Dockerfile.ubi: pin vllm-tgis-adapter to 0.1.2
dtrifiro Jun 26, 2024
ee779e6
gha: fix fetch step in upstream sync workflow
dtrifiro Jul 2, 2024
160ddb8
gha: always update sync workflow PR body/title
dtrifiro Jul 2, 2024
2478277
Dockerfile.ubi: bump vllm-tgis-adapter to 0.1.3
dtrifiro Jul 3, 2024
9dc4dd3
Dockerfile.ubi: get rid of --distributed-executor-backend=mp
dtrifiro Jul 10, 2024
96b598b
Dockerfile.ubi: add flashinfer
dtrifiro Jul 9, 2024
4d6fd09
pin adapter to 2.0.0
prashantgupta24 Jul 12, 2024
ef7738d
deps: bump flashinfer to 0.0.9
dtrifiro Jul 15, 2024
3972a7d
Update OWNERS with IBM folks
heyselbi Jun 27, 2024
9bac7b9
Dockerfile.ubi: bind mount .git dir to allow inclusion of git commit …
dtrifiro Jul 17, 2024
97d24e4
gha: remove reminder_comment
dtrifiro Jul 17, 2024
c4aa1e3
Dockerfile: bump vllm-tgis-adapter to 0.2.1
dtrifiro Jul 18, 2024
e856dd3
fix: update setup.py to differentiate between fork and upstream
nathan-weinberg Jul 18, 2024
8ac5afb
Dockerfile.ubi: properly mount .git dir
dtrifiro Jul 19, 2024
58cbebb
Revert "[CI/Build] fix: update setup.py to differentiate between fork…
dtrifiro Jul 19, 2024
d5373dd
Dockerfile.ubi: bump vllm-tgis-adapter to 0.2.2
dtrifiro Jul 19, 2024
013813d
gha: remove unused upstream workflows
dtrifiro Jul 23, 2024
53f9489
deps: bump vllm-tgis-adapter to 0.2.3
dtrifiro Jul 24, 2024
0a20a57
Dockerfile.ubi: get rid of custom cache manager
dtrifiro Jul 24, 2024
d67d8f7
Dockerfile.ubi: add missing dependency
dtrifiro Aug 6, 2024
dfe980d
deps: bump vllm-tgis-adapter to 0.3.0
dtrifiro Jul 24, 2024
c002b3f
Dockerfile.ubi: force using python-installed cuda runtime libraries
dtrifiro Aug 12, 2024
857e618
Dockerfile: use uv pip everywhere (it's faster)
dtrifiro Aug 12, 2024
75adb8a
Dockerfile.ubi: bump flashinfer to 0.1.2
dtrifiro Aug 5, 2024
ce5c1bb
feat: allow long max seq length
tjohnson31415 Aug 8, 2024
94625bd
smoke test: kill server on timeout
dtrifiro Aug 13, 2024
5c90a8b
Dockerfile.ubi: set vllm_tgis_adapter unicorn log level to warning
dtrifiro Aug 13, 2024
fe77683
fix: enable logprobs during spec decoding by default
tjohnson31415 Aug 20, 2024
3c5d24c
deps: bump vllm-tgis-adapter to 0.4.0 (#132)
vaibhavjainwiz Aug 21, 2024
a24dfae
Disable usage tracking
stevegrubb Aug 29, 2024
6acc54f
Start by updating the image
stevegrubb Sep 4, 2024
6d60952
Update ROCm build for UBI
Xaenalt Sep 3, 2024
d218479
Add sample chat template into vLLM container
vaibhavjainwiz Sep 10, 2024
c7872e9
Harden build of libsodium
stevegrubb Aug 27, 2024
09d0994
Update Dockerfile.ubi
RH-steve-grubb Sep 4, 2024
9041883
Update OWNERS file
vaibhavjainwiz Sep 16, 2024
d3f06b5
Dockerfile.rocm.ubi: cleanup
dtrifiro Sep 6, 2024
3cc6ea4
add vllm-tgis-adapter layer
dtrifiro Sep 11, 2024
b30f9ed
Dockerfile.ubi: bump python to 3.12
dtrifiro Sep 12, 2024
8238489
Dockerfile.ubi: bump flashinfer to 0.1.6
dtrifiro Sep 12, 2024
67080c2
Dockerfile.rocm.ubi: do not use nightly pytorch_triton
dtrifiro Sep 16, 2024
010c1bd
Dockerfile.ubi: fix PYTHON_VERSION arg usage
dtrifiro Sep 17, 2024
f91432f
Dockerfile.rocm.ubi: move microdnf update in base stage
dtrifiro Sep 25, 2024
b1b179f
Dockerfile.rocm.ubi: bump torch version to 2.5.0.dev20240912+rocm6.1
dtrifiro Sep 25, 2024
cd4b748
Dockerfile.rocm.ubi: get rid of build triton stage
dtrifiro Sep 25, 2024
38247ba
Dockerfile.rocm.ubi: add setuptools-scm build dependency
dtrifiro Sep 26, 2024
451470c
Dockerfile.ubi: add VLLM_FA_CMAKE_GPU_ARCHES
dtrifiro Sep 26, 2024
81a8400
Dockerfile.ubi: use COPY . . to make repo available when building wheel
dtrifiro Sep 27, 2024
ec1f663
This sets the vllm build to a Release build type, builds libsodium,
stevegrubb Sep 25, 2024
d16bf47
Move libsodium install
stevegrubb Sep 26, 2024
f63fbdd
Correct logging directory permissions
stevegrubb Sep 26, 2024
902985d
bump tgis adapter to v0.5.0
NickLucche Sep 27, 2024
9d9bb9c
Merge branch 'main' into sync_release
vaibhavjainwiz Sep 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
35 changes: 21 additions & 14 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,43 @@
import os
import sys
import zipfile

MAX_SIZE_MB = 250
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")
print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")


def check_wheel_size(directory):
"""Check the size of .whl files in the given directory."""
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
for file_name in files:
if file_name.endswith(".whl"):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
f"({wheel_size_mb:.2f} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
if len(sys.argv) < 2:
print("Usage: python check-wheel-size.py <directory>")
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
1 change: 0 additions & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Expand Down
3 changes: 1 addition & 2 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ steps:
containers:
- image: badouralix/curl-jq
command:
- sh
- .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait
- label: "A100"
agents:
Expand Down
4 changes: 3 additions & 1 deletion .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

retries=0
while [ $retries -lt 1000 ]; do
if [ $(curl -s -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
if [ $(curl -s --max-time $TIMEOUT_SECONDS -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
exit 0
fi

Expand Down
80 changes: 75 additions & 5 deletions .buildkite/run-amd-test.sh
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This script runs test inside the corresponding ROCm docker container.
set -ex
set -o pipefail

# Print ROCm version
echo "--- Confirming Clean Initial State"
Expand Down Expand Up @@ -70,15 +70,85 @@ HF_CACHE="$(realpath ~)/huggingface"
mkdir -p ${HF_CACHE}
HF_MOUNT="/root/.cache/huggingface"

docker run \
commands=$@
echo "Commands:$commands"
#ignore certain kernels tests
if [[ $commands == *" kernels "* ]]; then
commands="${commands} \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_attention_selector.py \
--ignore=kernels/test_blocksparse_attention.py \
--ignore=kernels/test_causal_conv1d.py \
--ignore=kernels/test_cutlass.py \
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
--ignore=kernels/test_marlin_gemm.py \
--ignore=kernels/test_moe.py \
--ignore=kernels/test_prefix_prefill.py \
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py"
fi

#ignore certain Entrypoints tests
if [[ $commands == *" entrypoints/openai "* ]]; then
commands=${commands//" entrypoints/openai "/" entrypoints/openai \
--ignore=entrypoints/openai/test_accuracy.py \
--ignore=entrypoints/openai/test_audio.py \
--ignore=entrypoints/openai/test_encoder_decoder.py \
--ignore=entrypoints/openai/test_embedding.py \
--ignore=entrypoints/openai/test_oot_registration.py "}
fi

PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments
commands=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
echo "Shard ${GPU} commands:$commands"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=${GPU} \
-e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \
-e HF_HOME=${HF_MOUNT} \
--name ${container_name} \
--name ${container_name}_${GPU} \
${image_name} \
/bin/bash -c "${@}"

/bin/bash -c "${commands}" \
|& while read -r line; do echo ">>Shard $GPU: $line"; done &
PIDS+=($!)
done
#wait for all processes to finish and collect exit codes
for pid in ${PIDS[@]}; do
wait ${pid}
STATUS+=($?)
done
for st in ${STATUS[@]}; do
if [[ ${st} -ne 0 ]]; then
echo "One of the processes failed with $st"
exit ${st}
fi
done
else
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \
-e HF_HOME=${HF_MOUNT} \
--name ${container_name} \
${image_name} \
/bin/bash -c "${commands}"
fi
33 changes: 33 additions & 0 deletions .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN=$HF_TOKEN --name cpu-test cpu-test

# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_oot_registration.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# online inference
docker exec cpu-test bash -c "
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
13 changes: 11 additions & 2 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,17 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_oot_registration.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
pip install pytest matplotlib einops transformers_stream_generator datamodel_code_generator
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynanmic_per_token"

# online inference
docker exec cpu-test bash -c "
Expand Down
3 changes: 1 addition & 2 deletions .buildkite/run-tpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,4 @@ remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu \
python3 /workspace/vllm/examples/offline_inference_tpu.py
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
Loading