Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
616 commits
Select commit Hold shift + click to select a range
b553e05
[Misc] Exclude the `tests` directory from being packaged (#4552)
itechbear May 2, 2024
37f8957
[BugFix] Include target-device specific requirements.txt in sdist (#4…
markmc May 2, 2024
d7f5c58
[Misc] centralize all usage of environment variables (#4548)
youkaichao May 2, 2024
df04c10
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
mmoskal May 2, 2024
299066f
[CI/Build] AMD CI pipeline with extended set of tests. (#4267)
Alexei-V-Ivanov-AMD May 2, 2024
3e9f425
[Core] Ignore infeasible swap requests. (#4557)
rkooo567 May 2, 2024
977a6cd
[Core][Distributed] enable allreduce for multiple tp groups (#4566)
youkaichao May 3, 2024
de6d42a
[BugFix] Prevent the task of `_force_log` from being garbage collecte…
Atry May 3, 2024
deb0ccc
[Misc] remove chunk detected debug logs (#4571)
DefTruth May 3, 2024
9500596
[Doc] add env vars to the doc (#4572)
youkaichao May 3, 2024
a5d0d0e
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)
rkooo567 May 3, 2024
ab445b1
[Bugfix] Allow "None" or "" to be passed to CLI for string args that …
mgoin May 3, 2024
83f0437
Fix/async chat serving (#2727)
schoennenbeck May 3, 2024
0c86070
[Kernel] Use flashinfer for decoding (#4353)
LiuXiaoxuanPKU May 3, 2024
81a9e09
[Speculative decoding] Support target-model logprobs (#4378)
cadedaniel May 3, 2024
cf0665c
[Misc] add installation time env vars (#4574)
youkaichao May 3, 2024
ecb55eb
[Misc][Refactor] Introduce ExecuteModelData (#4540)
comaniac May 4, 2024
8e82b90
[Doc] Chunked Prefill Documentation (#4580)
rkooo567 May 4, 2024
ba2be94
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with…
mgoin May 4, 2024
71bb251
[CI] check size of the wheels (#4319)
simon-mo May 4, 2024
ac5ccb6
[Bugfix] Fix inappropriate content of model_name tag in Prometheus me…
DearPlanet May 4, 2024
52b5bcb
bump version to v0.4.2 (#4600)
simon-mo May 5, 2024
c7426c1
[CI] Reduce wheel size by not shipping debug symbols (#4602)
simon-mo May 5, 2024
352ef7c
Disable cuda version check in vllm-openai image (#4530)
zhaoyang-star May 5, 2024
06241cf
[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623)
DarkLight1337 May 6, 2024
4c758aa
Update vLLM to 323f27b9
joerunde May 6, 2024
2caabff
format: make mypy happy (#24)
tjohnson31415 May 8, 2024
c737a7a
ci/build/feat: bump vLLM libs to v0.4.2 and other deps in Dockerfile.…
tjohnson31415 May 8, 2024
06d9876
TGISStatLogger: fix stats usage (#25)
tjohnson31415 May 8, 2024
21fb852
fix: use vllm_nccl installed nccl version (#26)
tjohnson31415 May 13, 2024
2e81ed2
:bug: fix prometheus metric labels (#27)
joerunde May 14, 2024
79dce26
[CI] use ccache actions properly in release workflow (#4629)
simon-mo May 6, 2024
d363d39
[CI] Add retry for agent lost (#4633)
cadedaniel May 6, 2024
a547717
Update lm-format-enforcer to 0.10.1 (#4631)
noamgat May 6, 2024
3798adb
[Kernel] Make static FP8 scaling more robust (#4570)
pcmoritz May 7, 2024
73323c3
[Core][Optimization] change python dict to pytorch tensor (#4607)
youkaichao May 7, 2024
ffc7024
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642)
Alexei-V-Ivanov-AMD May 7, 2024
07ccdeb
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithL…
FurtherAI May 7, 2024
4fb77a9
[Core][Optimization] change copy-on-write from dict[int, list] to lis…
youkaichao May 7, 2024
7088e42
[Bug fix][Core] fixup ngram not setup correctly (#4551)
leiwen83 May 7, 2024
1571342
[Core][Distributed] support cpu&device in broadcast tensor dict (#4660)
youkaichao May 8, 2024
9e4b2e2
[Core] Optimize sampler get_logprobs (#4594)
rkooo567 May 8, 2024
e7ebde1
[CI] Make mistral tests pass (#4596)
rkooo567 May 8, 2024
1bb5e89
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi …
DefTruth May 8, 2024
456bcbc
[Misc] Add `get_name` method to attention backends (#4685)
WoosukKwon May 8, 2024
5aedfe8
[Core] Faster startup for LoRA enabled models (#4634)
Yard1 May 8, 2024
2563537
[Core][Optimization] change python dict to pytorch tensor for blocks …
youkaichao May 8, 2024
a696be1
[CI/Test] fix swap test for multi gpu (#4689)
youkaichao May 8, 2024
fe03b5c
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
WoosukKwon May 8, 2024
4c17d62
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
comaniac May 8, 2024
683a105
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…
cadedaniel May 8, 2024
53a9503
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (…
alexm-redhat May 9, 2024
d6eb999
[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)
MahmoudAshraf97 May 9, 2024
b346a6d
[Frontend] Move async logic outside of constructor (#4674)
DarkLight1337 May 9, 2024
8427be7
[Misc] Remove unnecessary ModelRunner imports (#4703)
WoosukKwon May 9, 2024
e5b181e
[Misc] Set block size at initialization & Fix test_model_runner (#4705)
WoosukKwon May 9, 2024
0a838de
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
kliuae May 9, 2024
b4214c5
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)
DarkLight1337 May 9, 2024
df54be8
[Bugfix] Update grafana.json (#4711)
robertgshaw2-redhat May 9, 2024
475b9a0
[Bugfix] Add logs for all model dtype casting (#4717)
mgoin May 9, 2024
12d23f9
[Model] Snowflake arctic model implementation (#4652)
sfc-gh-hazhang May 9, 2024
d7e6b3f
[Kernel] [FP8] Improve FP8 linear layer performance (#4691)
pcmoritz May 9, 2024
439c463
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
comaniac May 10, 2024
ce0f149
[Core][Distributed] refactor pynccl (#4591)
youkaichao May 10, 2024
8cf6b87
[Misc] Keep only one implementation of the create_dummy_prompt functi…
AllenDou May 10, 2024
bd873f4
chunked-prefill-doc-syntax (#4603)
simon-mo May 10, 2024
4b0058f
[Core]fix type annotation for `swap_blocks` (#4726)
jikunshang May 10, 2024
bee64c4
[Misc] Apply a couple g++ cleanups (#4719)
stevegrubb May 10, 2024
3363a6b
[Core] Fix circular reference which leaked llm instance in local dev …
rkooo567 May 10, 2024
c56ae80
[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)
AllenDou May 10, 2024
3498e74
[Speculative decoding] CUDA graph support (#4295)
heeju-kim2 May 10, 2024
fffb10a
[CI] Nits for bad initialization of SeqGroup in testing (#4748)
robertgshaw2-redhat May 10, 2024
cd8f90f
[Core][Test] fix function name typo in custom allreduce (#4750)
youkaichao May 10, 2024
70fa8fd
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
CatherineSue May 11, 2024
e2302f4
[Model] Add support for IBM Granite Code models (#4636)
yikangshen May 12, 2024
3942ef1
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)
robertgshaw2-redhat May 13, 2024
6410635
[CORE] Improvement in ranks code (#4718)
SwapnilDreams100 May 13, 2024
0493233
[Core][Distributed] refactor custom allreduce to support multiple tp …
youkaichao May 13, 2024
35a3273
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
DarkLight1337 May 13, 2024
7a0a670
[Scheduler] Warning upon preemption and Swapping (#4647)
rkooo567 May 13, 2024
98d62a2
[Misc] Enhance attention selector (#4751)
WoosukKwon May 13, 2024
64d2fdc
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, u…
sangstar May 13, 2024
47f50c5
[Speculative decoding] Improve n-gram efficiency (#4724)
comaniac May 13, 2024
28c395f
[Kernel] Use flash-attn for decoding (#3648)
skrider May 13, 2024
95411c6
[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793)
pcmoritz May 13, 2024
f4270f2
[Doc] Shorten README by removing supported model list (#4796)
zhuohan123 May 13, 2024
c75ceb4
[Doc] Add API reference for offline inference (#4710)
DarkLight1337 May 14, 2024
0e5d2a9
[Doc] Add meetups to the doc (#4798)
zhuohan123 May 14, 2024
ed2d743
[Core][Hash][Automatic Prefix caching] Accelerating the hashing funct…
KuntaiDu May 14, 2024
929ecdc
[Bugfix][Doc] Fix CI failure in docs (#4804)
DarkLight1337 May 14, 2024
3008471
[Core] Add MultiprocessingGPUExecutor (#4539)
njhill May 14, 2024
73a4168
Add 4th meetup announcement to readme (#4817)
simon-mo May 14, 2024
a69f3af
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
rkooo567 May 15, 2024
71cd938
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill …
rkooo567 May 15, 2024
6d46185
[CI/Build] Further decouple HuggingFace implementation from ours duri…
DarkLight1337 May 15, 2024
4e0ddd9
[Bugfix] Properly set distributed_executor_backend in ParallelConfig …
zifeitong May 15, 2024
7df0a0b
[Doc] Highlight the fourth meetup in the README (#4842)
zhuohan123 May 15, 2024
e9ddce5
[Frontend] Re-enable custom roles in Chat Completions API (#4758)
DarkLight1337 May 15, 2024
7c731a9
[Frontend] Support OpenAI batch file format (#4794)
wuisawesome May 15, 2024
9002ba4
[Core] Implement sharded state loader (#4690)
aurickq May 16, 2024
f832b56
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
comaniac May 16, 2024
117f7b4
Add marlin unit tests and marlin benchmark script (#4815)
alexm-redhat May 16, 2024
7bc509e
[Kernel] add bfloat16 support for gptq marlin kernel (#4788)
jinzhen-lin May 16, 2024
05afbc4
[docs] Fix typo in examples filename openi -> openai (#4864)
wuisawesome May 16, 2024
982a80b
[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851)
wuisawesome May 16, 2024
ddcdb15
[Bugfix] Bypass authorization API token for preflight requests (#4862)
dulacp May 16, 2024
1b7a015
Add GPTQ Marlin 2:4 sparse structured support (#4790)
alexm-redhat May 16, 2024
b5a7ecd
Add JSON output support for benchmark_latency and benchmark_throughpu…
simon-mo May 16, 2024
7459ec4
[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845)
hongxiayang May 16, 2024
eb62283
[Core][Distributed] remove graph mode function (#4818)
youkaichao May 16, 2024
e1aad8a
[Misc] remove old comments (#4866)
youkaichao May 16, 2024
b551e55
[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Silencioo May 16, 2024
71b4283
Update vLLM to 8435b207
tjohnson31415 May 16, 2024
3ac6575
:bug: fixup merge conflicts
joerunde May 16, 2024
6b06bf0
:fire: remove flash attention
joerunde May 16, 2024
9499dce
Use fork for worker multiprocessing method (#29)
njhill May 16, 2024
0b16320
[Kernel] Add w8a8 CUTLASS kernels (#4749)
tlrmchlsmth May 16, 2024
d13ad85
[Bugfix] Fix FP8 KV cache support (#4869)
WoosukKwon May 16, 2024
fd979cb
Support to serve vLLM on Kubernetes with LWS (#4829)
kerthcet May 16, 2024
f6e95fb
[Frontend] OpenAI API server: Do not add bos token by default when en…
bofenghuang May 17, 2024
afcf4c8
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
Alexei-V-Ivanov-AMD May 17, 2024
1f614f9
[Bugfix] fix rope error when load models with different dtypes (#4835)
jinzhen-lin May 17, 2024
d0f3b87
Sync huggingface modifications of qwen Moe model (#4774)
eigen2017 May 17, 2024
eab073d
[Doc] Update Ray Data distributed offline inference example (#4871)
Yard1 May 17, 2024
d7f076f
[Bugfix] Relax tiktoken to >= 0.6.0 (#4890)
mgoin May 17, 2024
94a7c8b
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if…
akondrat-amd May 18, 2024
1102c61
[Lora] Support long context lora (#4787)
rkooo567 May 18, 2024
fc3cc45
[Bugfix][Model] Add base class for vision-language models (#4809)
DarkLight1337 May 19, 2024
41da12f
[Kernel] Add marlin_24 unit tests (#4901)
alexm-redhat May 19, 2024
fd1308b
[Kernel] Add flash-attn back (#4907)
WoosukKwon May 20, 2024
2cc299f
[Model] LLaVA model refactor (#4910)
DarkLight1337 May 20, 2024
c23600a
Remove marlin warning (#4918)
alexm-redhat May 20, 2024
f022464
Update vLLM to da5a0b53
joerunde May 20, 2024
066041a
:sparkles: log all errored requests (#30)
joerunde May 20, 2024
4af59d3
Apply temp. patch to Triton code to resolve conflicting cache dirs in…
tdoublep May 28, 2024
3dc2819
Add guided decoding to TGIS gRPC API (#31)
njhill May 30, 2024
24f4ff0
feat: install tensorizer in the UBI image (#36)
tjohnson31415 Jun 3, 2024
d7e8e8c
Small changes to support releases (#37)
joerunde Jun 3, 2024
dc42ed9
[Misc]: allow user to specify port in distributed setting (#4914)
ZwwWayne May 20, 2024
292c4ec
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Alexei-V-Ivanov-AMD May 20, 2024
aa76008
[Bugfix] Fix dummy weight for fp8 (#4916)
mzusman May 20, 2024
2ec86bd
[Core] Sharded State Loader download from HF (#4889)
aurickq May 20, 2024
dac0039
[Doc]Add documentation to benchmarking script when running TGI (#4920)
KuntaiDu May 20, 2024
92d09c5
[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897)
Yard1 May 21, 2024
551315f
[Model] add rope_scaling support for qwen2 (#4930)
hzhwcmhf May 21, 2024
9a36a5b
[Model] Add Phi-2 LoRA support (#4886)
Isotr0py May 21, 2024
7099fae
[Docs] Add acknowledgment for sponsors (#4925)
simon-mo May 21, 2024
3d9eefc
[CI/Build] Codespell ignore `build/` directory (#4945)
mgoin May 21, 2024
8976308
[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935)
kerthcet May 21, 2024
4834f01
[Bugfix][Kernel] Add head size check for attention backend selection …
Isotr0py May 21, 2024
d002313
[Frontend] Dynamic RoPE scaling (#4638)
sasha0552 May 22, 2024
a0bf570
[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…
mgoin May 22, 2024
81b685e
[misc] remove comments that were supposed to be removed (#4977)
rkooo567 May 22, 2024
56382f4
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
tlrmchlsmth May 22, 2024
0fc07da
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
comaniac May 22, 2024
b9d9b01
[Model] LoRA gptbigcode implementation (#3949)
raywanb May 22, 2024
b76dccd
[Core] Eliminate parallel worker per-step task scheduling overhead (#…
njhill May 22, 2024
ca9af77
[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…
pcmoritz May 22, 2024
2b9d3b1
[Misc] Take user preference in attention selector (#4960)
comaniac May 22, 2024
4b7e7a2
Marlin 24 prefill performance improvement (about 25% better on averag…
alexm-redhat May 23, 2024
5fd2ffd
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…
LetianLee May 23, 2024
b0e20f3
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
andoorve May 23, 2024
24bd163
[Kernel] Initial Activation Quantization Support (#4525)
dsikka May 23, 2024
1a2e192
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
kezouke May 23, 2024
099dfe4
[Doc] add ccache guide in doc (#5012)
youkaichao May 23, 2024
62c722b
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
robertgshaw2-redhat May 24, 2024
b5a96fd
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
leiwen83 May 24, 2024
55f17cc
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…
linxihui May 25, 2024
91fadd7
[Misc] add logging level env var (#5045)
youkaichao May 25, 2024
cf06cc8
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding …
LiuXiaoxuanPKU May 25, 2024
2b78b55
[Misc] Make Serving Benchmark More User-friendly (#5044)
ywang96 May 25, 2024
c0dc88e
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
zhuohan123 May 27, 2024
f2ab617
[Core] Allow AQLM on Pascal (#5058)
sasha0552 May 27, 2024
eec5663
[Model] Add support for falcon-11B (#5069)
Isotr0py May 27, 2024
1d8a686
[Core] Sliding window for block manager v2 (#4545)
mmoskal May 28, 2024
541ccac
[BugFix] Fix Embedding Models with TP>1 (#5075)
robertgshaw2-redhat May 28, 2024
c0a626a
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
divakar-amd May 28, 2024
9775a22
[Docs] Add Dropbox as sponsors (#5089)
simon-mo May 28, 2024
85ed9d0
[Core] Consolidate prompt arguments to LLM engines (#4328)
DarkLight1337 May 28, 2024
213c7d0
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
jsato8094 May 29, 2024
264bbf0
[Misc] add gpu_memory_utilization arg (#5079)
pandyamarut May 29, 2024
53b9e4c
[Core][Optimization] remove vllm-nccl (#5091)
youkaichao May 29, 2024
08cdcfc
[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092)
DarkLight1337 May 29, 2024
b484450
[Core][Distributed] improve p2p access check (#4992)
youkaichao May 29, 2024
ca593c2
[Core] Cross-attention KV caching and memory-management (towards even…
afeldman-nm May 29, 2024
9fa589e
[Doc]Replace deprecated flag in readme (#4526)
ronensc May 29, 2024
d603c5d
[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterato…
DarkLight1337 May 29, 2024
afa91c9
[Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` …
DarkLight1337 May 29, 2024
e94d91b
[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099)
DarkLight1337 May 29, 2024
cf7e434
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031)
Etelis May 29, 2024
ba4c229
[Doc][Build] update after removing vllm-nccl (#5103)
youkaichao May 29, 2024
8ee205e
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#…
alexm-redhat May 30, 2024
0cd9ca4
[CI/Build] Docker cleanup functionality for amd servers (#5112)
okakarpa May 30, 2024
ea1db42
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
br3no May 30, 2024
e1dc83e
[Bugfix] Automatically Detect SparseML models (#5119)
robertgshaw2-redhat May 30, 2024
51418e5
[CI/Build] increase wheel size limit to 200 MB (#5130)
youkaichao May 30, 2024
cbc6703
[Misc] remove duplicate definition of `seq_lens_tensor` in model_runn…
ita9naiwa May 30, 2024
5877363
[Doc] Use intersphinx and update entrypoints docs (#5125)
DarkLight1337 May 30, 2024
12eaba2
add doc about serving option on dstack (#3074)
deep-diver May 30, 2024
8847bc6
Bump version to v0.4.3 (#5046)
simon-mo May 30, 2024
81de9b1
[Build] Disable sm_90a in cu11 (#5141)
simon-mo May 30, 2024
b48cefe
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
robertgshaw2-redhat May 31, 2024
adcf9cb
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::orde…
alexm-redhat May 31, 2024
c320b5b
Fix cutlass sm_90a vesrion in CMakeList
simon-mo May 31, 2024
70a2e0a
[Model] Support MAP-NEO model (#5081)
xingweiqu May 31, 2024
027c4df
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using th…
simon-mo May 31, 2024
8f42cbe
[Misc]: optimize eager mode host time (#4196)
FuncSherl May 31, 2024
a7d0b3d
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
comaniac May 31, 2024
a46e8a9
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171)
njhill Jun 1, 2024
626c93d
[Build] Guard against older CUDA versions when building CUTLASS 3.x k…
tlrmchlsmth Jun 1, 2024
f4ec244
Update vLLM to 1197e021
joerunde Jun 3, 2024
a17c8fb
Revert previous attempt at Triton patch; use CustomCacheManger approa…
tdoublep Jun 3, 2024
35de027
Add factories for logits_processors
maxdebayser Jun 3, 2024
f803ca2
Address review commments and simplify code
maxdebayser Jun 4, 2024
d5b47f5
Revert to original formatting
maxdebayser Jun 4, 2024
ef3e030
043 release fixes (#40)
joerunde Jun 4, 2024
ac902ef
[CI/Build] CMakeLists: build all extensions' cmake targets at the sam…
dtrifiro Jun 1, 2024
4f7c5a1
[Kernel] Refactor CUTLASS kernels to always take scales that reside o…
tlrmchlsmth Jun 1, 2024
c545c94
[Kernel] Update Cutlass fp8 configs (#5144)
varun-sundar-rabindranath Jun 1, 2024
18a4a37
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> sav…
dashanji Jun 1, 2024
78beb36
[Bugfix] Fix call to init_logger in openai server (#4765)
NadavShmayo Jun 1, 2024
04af8d9
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
chenqianfzh Jun 1, 2024
fe27b98
[Bugfix] Remove deprecated @abstractproperty (#5174)
zhuohan123 Jun 1, 2024
939e0d4
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180)
Delviet Jun 1, 2024
3f21be2
[BugFix] Prevent `LLM.encode` for non-generation Models (#5184)
robertgshaw2-redhat Jun 1, 2024
e448589
Update test_ignore_eos (#4898)
simon-mo Jun 2, 2024
0748547
[Frontend][OpenAI] Support for returning max_model_len on /v1/models …
Avinash-Raj Jun 2, 2024
cec4364
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#…
divakar-amd Jun 2, 2024
a53e398
[Misc] Simplify code and fix type annotations in `conftest.py` (#5118)
DarkLight1337 Jun 2, 2024
b1deaf3
[Core] Support image processor (#4197)
DarkLight1337 Jun 3, 2024
989c7b3
[Core] Remove unnecessary copies in flash attn backend (#5138)
Yard1 Jun 3, 2024
499ac4e
[Kernel] Pass a device pointer into the quantize kernel for the scale…
tlrmchlsmth Jun 3, 2024
bac28b3
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
zhouyuan Jun 3, 2024
b0563b0
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
Kaiyang-Chen Jun 3, 2024
b7de754
New CI template on AWS stack (#5110)
khluu Jun 3, 2024
aa19635
[FRONTEND] OpenAI `tools` support named functions (#5032)
br3no Jun 3, 2024
16804c0
[Bugfix] Support `prompt_logprobs==0` (#5217)
toslunar Jun 4, 2024
ee15107
[Bugfix] Add warmup for prefix caching example (#5235)
zhuohan123 Jun 4, 2024
d74f5fb
[Kernel] Enhance MoE benchmarking & tuning script (#4921)
WoosukKwon Jun 4, 2024
ad2c81c
[Bugfix]: During testing, use pytest monkeypatch for safely overridin…
afeldman-nm Jun 4, 2024
9fd018a
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecu…
zifeitong Jun 4, 2024
0f78092
[CI/Build] Add inputs tests (#5215)
DarkLight1337 Jun 4, 2024
8dddd6b
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU b…
DamonFool Jun 4, 2024
bdbb931
[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242)
WoosukKwon Jun 4, 2024
72e195a
[CI/Build] Simplify model loading for `HfRunner` (#5251)
DarkLight1337 Jun 4, 2024
14dd5a1
[CI/Build] Reducing CPU CI execution time (#5241)
bigPYJ1151 Jun 4, 2024
f6af8d4
[CI] mark AMD test as softfail to prevent blockage (#5256)
simon-mo Jun 4, 2024
3e9a627
[Misc] Add transformers version to collect_env.py (#5259)
mgoin Jun 4, 2024
3b9b2bb
Merge branch 'main' into lp_factories
maxdebayser Jun 6, 2024
0be582d
address review comments
maxdebayser Jun 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
36 changes: 36 additions & 0 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import os
import zipfile

MAX_SIZE_MB = 200


def print_top_10_largest_files(zip_file):
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")


def check_wheel_size(directory):
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
18 changes: 18 additions & 0 deletions .buildkite/download-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash

set -ex
set -o pipefail

(which wget && which curl) || (apt-get update && apt-get install -y wget curl)

# aws s3 sync s3://air-example-data-2/vllm_opensource_llava/ images/
mkdir -p images
cd images
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_pixel_values.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_image_features.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_pixel_values.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_image_features.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom.jpg

cd -
93 changes: 64 additions & 29 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -1,38 +1,73 @@
# This script build the ROCm docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
# This script runs test inside the corresponding ROCm docker container.
set -ex

# Print ROCm version
echo "--- ROCm info"
rocminfo

# Try building the docker image
docker build -t rocm -f Dockerfile.rocm .
# cleanup older docker images
cleanup_docker() {
# Get Docker's root directory
docker_root=$(docker info -f '{{.DockerRootDir}}')
if [ -z "$docker_root" ]; then
echo "Failed to determine Docker root directory."
exit 1
fi
echo "Docker root directory: $docker_root"
# Check disk usage of the filesystem where Docker's root directory is located
disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//')
# Define the threshold
threshold=70
if [ "$disk_usage" -gt "$threshold" ]; then
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes
docker volume prune -f
echo "Docker images and volumes cleanup completed."
else
echo "Disk usage is below $threshold%. No cleanup needed."
fi
}

# Setup cleanup
remove_docker_container() { docker rm -f rocm || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device /dev/kfd --device /dev/dri --network host --name rocm rocm python3 -m vllm.entrypoints.api_server &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
# Call the cleanup docker function
cleanup_docker

echo "--- Resetting GPUs"

echo "reset" > /opt/amdgpu/etc/gpu_state

while true; do
sleep 3
if grep -q clean /opt/amdgpu/etc/gpu_state; then
echo "GPUs state is \"clean\""
break
fi
done
done

echo "--- Building container"
sha=$(git rev-parse --short HEAD)
image_name=rocm_${sha}
container_name=rocm_${sha}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)
docker build \
-t ${image_name} \
-f Dockerfile.rocm \
--progress plain \
.

remove_docker_container() {
docker rm -f ${container_name} || docker image rm -f ${image_name} || true
}
wait_for_server_to_start
trap remove_docker_container EXIT

echo "--- Running container"

docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--rm \
-e HF_TOKEN \
--name ${container_name} \
${image_name} \
/bin/bash -c "${@}"

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
21 changes: 15 additions & 6 deletions .buildkite/run-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ cd "$(dirname "${BASH_SOURCE[0]}")/.."
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)

# run python-based benchmarks and upload the result to buildkite
python3 benchmarks/benchmark_latency.py 2>&1 | tee benchmark_latency.txt
python3 benchmarks/benchmark_latency.py --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
bench_latency_exit_code=$?

python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 2>&1 | tee benchmark_throughput.txt
python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
bench_throughput_exit_code=$?

# run server-based benchmarks and upload the result to buildkite
Expand All @@ -23,8 +23,9 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
# wait for server to start, timeout after 600 seconds
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend openai \
--dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \
--backend vllm \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--model meta-llama/Llama-2-7b-chat-hf \
--num-prompts 20 \
--endpoint /v1/completions \
Expand All @@ -48,7 +49,14 @@ sed -n '$p' benchmark_throughput.txt >> benchmark_results.md # last line
echo "### Serving Benchmarks" >> benchmark_results.md
sed -n '1p' benchmark_serving.txt >> benchmark_results.md # first line
echo "" >> benchmark_results.md
tail -n 13 benchmark_serving.txt >> benchmark_results.md # last 13 lines
echo '```' >> benchmark_results.md
tail -n 20 benchmark_serving.txt >> benchmark_results.md # last 20 lines
echo '```' >> benchmark_results.md

# if the agent binary is not found, skip uploading the results, exit 0
if [ ! -f /workspace/buildkite-agent ]; then
exit 0
fi

# upload the results to buildkite
/workspace/buildkite-agent annotate --style "info" --context "benchmark-results" < benchmark_results.md
Expand All @@ -66,4 +74,5 @@ if [ $bench_serving_exit_code -ne 0 ]; then
exit $bench_serving_exit_code
fi

/workspace/buildkite-agent artifact upload openai-*.json
rm ShareGPT_V3_unfiltered_cleaned_split.json
/workspace/buildkite-agent artifact upload "*.json"
24 changes: 24 additions & 0 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run -itd -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 --cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test cpu-test

# offline inference
docker exec cpu-test bash -c "python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "cd tests;
pip install pytest Pillow protobuf
bash ../.buildkite/download-images.sh
cd ../
pytest -v -s tests/models --ignore=tests/models/test_llava.py --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py"
51 changes: 51 additions & 0 deletions .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -e

# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

# prune old image and containers to save disk space, and only once a day
# by using a timestamp file in tmp.
if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
docker system prune -f
echo $current_time > /tmp/neuron-docker-build-timestamp
fi
else
echo $(date +%s) > /tmp/neuron-docker-build-timestamp
fi

docker build -t neuron -f Dockerfile.neuron .

# Setup cleanup
remove_docker_container() { docker rm -f neuron || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
Loading