Skip to content
This repository was archived by the owner on Sep 4, 2025. It is now read-only.
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
473e7b3
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel (#9350)
WoosukKwon Oct 14, 2024
4d31cd4
[Frontend] merge beam search implementations (#9296)
LunrEclipse Oct 14, 2024
f0fe4fe
[Model] Make llama3.2 support multiple and interleaved images (#9095)
xiangxu-google Oct 14, 2024
169b530
[Bugfix] Clean up some cruft in mamba.py (#9343)
tlrmchlsmth Oct 15, 2024
44eaa5a
[Frontend] Clarify model_type error messages (#9345)
stevegrubb Oct 15, 2024
8e836d9
[Doc] Fix code formatting in spec_decode.rst (#9348)
mgoin Oct 15, 2024
55e081f
[Bugfix] Update InternVL input mapper to support image embeds (#9351)
hhzhang16 Oct 15, 2024
e9d517f
[BugFix] Fix chat API continuous usage stats (#9357)
njhill Oct 15, 2024
5d264f4
pass ignore_eos parameter to all benchmark_serving calls (#9349)
gracehonv Oct 15, 2024
22f8a69
[Misc] Directly use compressed-tensors for checkpoint definitions (#8…
mgoin Oct 15, 2024
ba30942
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with emp…
CatherineSue Oct 15, 2024
717a5f8
[Bugfix][CI/Build] Fix CUDA 11.8 Build (#9386)
LucasWilkinson Oct 16, 2024
ed92013
[Bugfix] Molmo text-only input bug fix (#9397)
mrsalehi Oct 16, 2024
7e7eae3
[Misc] Standardize RoPE handling for Qwen2-VL (#9250)
DarkLight1337 Oct 16, 2024
7abba39
[Model] VLM2Vec, the first multimodal embedding model in vLLM (#9303)
DarkLight1337 Oct 16, 2024
1de76a0
[CI/Build] Test VLM embeddings (#9406)
DarkLight1337 Oct 16, 2024
cee711f
[Core] Rename input data types (#8688)
DarkLight1337 Oct 16, 2024
59230ef
[Misc] Consolidate example usage of OpenAI client for multimodal mode…
ywang96 Oct 16, 2024
cf1d62a
[Model] Support SDPA attention for Molmo vision backbone (#9410)
Isotr0py Oct 16, 2024
415f76a
Support mistral interleaved attn (#9414)
patrickvonplaten Oct 16, 2024
fb60ae9
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189)
mzusman Oct 16, 2024
5b8a1fd
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCP…
0xjunhao Oct 16, 2024
8345045
[Performance][Spec Decode] Optimize ngram lookup performance (#9333)
LiuXiaoxuanPKU Oct 16, 2024
776dbd7
[CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267)
russellb Oct 16, 2024
c3fab5f
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token qu…
tlrmchlsmth Oct 16, 2024
92d86da
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391)
rasmith Oct 17, 2024
dbfa8d3
Add notes on the use of Slack (#9442)
terrytangyuan Oct 17, 2024
e312e52
[Kernel] Add Exllama as a backend for compressed-tensors (#9395)
LucasWilkinson Oct 17, 2024
390be74
[Misc] Print stack trace using `logger.exception` (#9461)
DarkLight1337 Oct 17, 2024
9d30a05
[misc] CUDA Time Layerwise Profiler (#8337)
LucasWilkinson Oct 17, 2024
5e443b5
[Bugfix] Allow prefill of assistant response when using `mistral_comm…
sasha0552 Oct 17, 2024
8e1cddc
[TPU] Call torch._sync(param) during weight loading (#9437)
WoosukKwon Oct 17, 2024
5eda21e
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344)
bigPYJ1151 Oct 17, 2024
81ede99
[Core] Deprecating block manager v1 and make block manager v2 default…
KuntaiDu Oct 17, 2024
a2c71c5
[CI/Build] remove .github from .dockerignore, add dirty repo check (#…
dtrifiro Oct 17, 2024
7871659
[Misc] Remove commit id file (#9470)
DarkLight1337 Oct 17, 2024
0f41fbe
[torch.compile] Fine-grained CustomOp enabling mechanism (#9300)
ProExpertProg Oct 17, 2024
eca2c5f
[Bugfix] Fix support for dimension like integers and ScalarType (#9299)
bnellnm Oct 17, 2024
d65049d
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving s…
wukaixingxp Oct 17, 2024
d615b5c
[Bugfix] Print warnings related to `mistral_common` tokenizer only on…
sasha0552 Oct 17, 2024
bb76538
[Hardwware][Neuron] Simplify model load for transformers-neuronx libr…
sssrijan-amazon Oct 17, 2024
343f8e0
Support `BERTModel` (first `encoder-only` embedding model) (#9056)
robertgshaw2-redhat Oct 17, 2024
48138a8
[BugFix] Stop silent failures on compressed-tensors parsing (#9381)
dsikka Oct 18, 2024
de4008e
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory u…
joerunde Oct 18, 2024
154a8ae
[Qwen2.5] Support bnb quant for Qwen2.5 (#9467)
blueyo0 Oct 18, 2024
944dd8e
[CI/Build] Use commit hash references for github actions (#9430)
russellb Oct 18, 2024
1ffc8a7
[BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473)
njhill Oct 18, 2024
d2b1bf5
[Frontend][Feature] Add jamba tool parser (#9154)
tomeras91 Oct 18, 2024
25aeb7d
[BugFix] Fix and simplify completion API usage streaming (#9475)
njhill Oct 18, 2024
1bbbcc0
[CI/Build] Fix lint errors in mistral tokenizer (#9504)
DarkLight1337 Oct 18, 2024
ae8b633
[Bugfix] Fix offline_inference_with_prefix.py (#9505)
tlrmchlsmth Oct 18, 2024
7dbe738
[Misc] benchmark: Add option to set max concurrency (#9390)
russellb Oct 18, 2024
051eaf6
[Model] Add user-configurable task for models that support both gener…
DarkLight1337 Oct 18, 2024
67a7e5e
[CI/Build] Add error matching config for mypy (#9512)
russellb Oct 18, 2024
3921a2f
[Model] Support Pixtral models in the HF Transformers format (#9036)
mgoin Oct 18, 2024
9bb10a7
[MISC] Add lora requests to metrics (#9477)
coolkp Oct 18, 2024
d11bf43
[MISC] Consolidate cleanup() and refactor offline_inference_with_pref…
comaniac Oct 18, 2024
0c9a525
[Kernel] Add env variable to force flashinfer backend to enable tenso…
tdoublep Oct 19, 2024
337ed76
[Bugfix] Fix offline mode when using `mistral_common` (#9457)
sasha0552 Oct 19, 2024
380e186
:bug: fix torch memory profiling (#9516)
joerunde Oct 19, 2024
1325872
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessari…
njhill Oct 19, 2024
82c2515
[Doc] update gpu-memory-utilization flag docs (#9507)
joerunde Oct 19, 2024
dfd951e
[CI/Build] Add error matching for ruff output (#9513)
russellb Oct 19, 2024
85dc92f
[CI/Build] Configure matcher for actionlint workflow (#9511)
russellb Oct 19, 2024
c5eea3c
[Frontend] Support simpler image input format (#9478)
yue-anyscale Oct 19, 2024
263d8ee
[Bugfix] Fix missing task for speculative decoding (#9524)
DarkLight1337 Oct 19, 2024
8e3e7f2
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf (#9…
mgoin Oct 19, 2024
5b59fe0
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stron…
heheda12345 Oct 20, 2024
962d2c6
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision (…
mgoin Oct 20, 2024
4fa3e33
[Kernel] Support sliding window in flash attention backend (#9403)
heheda12345 Oct 20, 2024
855e0e6
[Frontend][Misc] Goodput metric support (#9338)
Imss27 Oct 20, 2024
696b01a
[CI/Build] Split up decoder-only LM tests (#9488)
DarkLight1337 Oct 21, 2024
496e991
[Doc] Consistent naming of attention backends (#9498)
tdoublep Oct 21, 2024
f6b9729
[Model] FalconMamba Support (#9325)
dhiaEddineRhaiem Oct 21, 2024
8ca8954
[Bugfix][Misc]: fix graph capture for decoder (#9549)
yudian0504 Oct 21, 2024
ec6bd6c
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint (#9492)
varad-ahirwadkar Oct 21, 2024
5241aa1
[Model][Bugfix] Fix batching with multi-image in PixtralHF (#9518)
mgoin Oct 21, 2024
9d9186b
[Frontend] Reduce frequency of client cancellation checking (#7959)
njhill Oct 21, 2024
d621c43
[doc] fix format (#9562)
youkaichao Oct 21, 2024
15713e3
[BugFix] Update draft model TP size check to allow matching target TP…
njhill Oct 21, 2024
711f3a7
[Frontend] Don't log duplicate error stacktrace for every request in …
wallashss Oct 21, 2024
575dceb
[CI] Make format checker error message more user-friendly by using em…
KuntaiDu Oct 21, 2024
ef7faad
:bug: Fixup more test failures from memory profiling (#9563)
joerunde Oct 22, 2024
76a5e13
[core] move parallel sampling out from vllm core (#9302)
youkaichao Oct 22, 2024
b729901
[Bugfix]: serialize config by value for --trust-remote-code (#6751)
tjohnson31415 Oct 22, 2024
f085995
[CI/Build] Remove unnecessary `fork_new_process` (#9484)
DarkLight1337 Oct 22, 2024
29acd2c
[Bugfix][OpenVINO] fix_dockerfile_openvino (#9552)
ngrozae Oct 22, 2024
7469242
[Bugfix]: phi.py get rope_theta from config file (#9503)
Falko1 Oct 22, 2024
c029221
[CI/Build] Replaced some models on tests for smaller ones (#9570)
wallashss Oct 22, 2024
ca30c3c
[Core] Remove evictor_v1 (#9572)
KuntaiDu Oct 22, 2024
f7db5f0
[Doc] Use shell code-blocks and fix section headers (#9508)
rafvasq Oct 22, 2024
0d02747
support TP in qwen2 bnb (#9574)
chenqianfzh Oct 22, 2024
3ddbe25
[Hardware][CPU] using current_platform.is_cpu (#9536)
wangshuai09 Oct 22, 2024
6c5af09
[V1] Implement vLLM V1 [1/N] (#9289)
WoosukKwon Oct 22, 2024
a48e3ec
[CI/Build][LoRA] Temporarily fix long context failure issue (#9579)
jeejeelee Oct 22, 2024
9dbcce8
[Neuron] [Bugfix] Fix neuron startup (#9374)
xendo Oct 22, 2024
bb392ea
[Model][VLM] Initialize support for Mono-InternVL model (#9528)
Isotr0py Oct 22, 2024
08075c3
[Bugfix] Eagle: change config name for fc bias (#9580)
gopalsarda Oct 22, 2024
32a1ee7
[Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212)
zhouyuan Oct 22, 2024
434984e
[Frontend] Support custom request_id from request (#9550)
guoyuhong Oct 22, 2024
cd5601a
[BugFix] Prevent exporting duplicate OpenTelemetry spans (#9017)
ronensc Oct 22, 2024
17c79f3
[torch.compile] auto infer dynamic_arg_dims from type annotation (#9589)
youkaichao Oct 22, 2024
23b899a
[Bugfix] fix detokenizer shallow copy (#5919)
aurickq Oct 22, 2024
cb6fdaa
[Misc] Make benchmarks use EngineArgs (#9529)
JArnoldAMD Oct 22, 2024
d1e8240
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on…
LucasWilkinson Oct 22, 2024
b17046e
[BugFix] Fix metrics error for --num-scheduler-steps > 1 (#8234)
yuleil Oct 22, 2024
208cb34
[Doc]: Update tensorizer docs to include vllm[tensorizer] (#7889)
sethkimmel3 Oct 22, 2024
65050a4
[Bugfix] Generate exactly input_len tokens in benchmark_throughput (#…
heheda12345 Oct 23, 2024
29061ed
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend…
sfc-gh-zhwang Oct 23, 2024
831540c
[Model] Support E5-V (#9576)
DarkLight1337 Oct 23, 2024
51c24c9
[Build] Fix `FetchContent` multiple build issue (#9596)
ProExpertProg Oct 23, 2024
2394962
[Hardware][XPU] using current_platform.is_xpu (#9605)
MengqingCao Oct 23, 2024
3ff57eb
[Model] Initialize Florence-2 language backbone support (#9555)
Isotr0py Oct 23, 2024
c18e1a3
[VLM] Enable overriding whether post layernorm is used in vision enco…
DarkLight1337 Oct 23, 2024
31a08f5
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs…
alex-jw-brooks Oct 23, 2024
e7116c0
[Bugfix] Fix `_init_vision_model` in NVLM_D model (#9611)
DarkLight1337 Oct 23, 2024
dbdd3b5
[misc] comment to avoid future confusion about baichuan (#9620)
youkaichao Oct 23, 2024
e5ac6a4
[Bugfix] Fix divide by zero when serving Mamba models (#9617)
tlrmchlsmth Oct 23, 2024
fd0e2cf
[Misc] Separate total and output tokens in benchmark_throughput.py (#…
mgoin Oct 23, 2024
9013e24
[torch.compile] Adding torch compile annotations to some models (#9614)
CRZbulabula Oct 23, 2024
150b779
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
alex-jw-brooks Oct 23, 2024
fc6c274
[Model] Add Qwen2-Audio model support (#9248)
faychu Oct 23, 2024
b548d7a
[CI/Build] Add bot to close stale issues and PRs (#9436)
russellb Oct 23, 2024
bb01f29
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched mul…
mgoin Oct 24, 2024
b7df53c
[Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
mgoin Oct 24, 2024
33bab41
[Bugfix]: Make chat content text allow type content (#9358)
vrdn-23 Oct 24, 2024
056a68c
[XPU] avoid triton import for xpu (#9440)
yma11 Oct 24, 2024
836e8ef
[Bugfix] Fix PP for ChatGLM and Molmo (#9422)
DarkLight1337 Oct 24, 2024
3770071
[V1][Bugfix] Clean up requests when aborted (#9629)
WoosukKwon Oct 24, 2024
4fdc581
[core] simplify seq group code (#9569)
youkaichao Oct 24, 2024
8a02cd0
[torch.compile] Adding torch compile annotations to some models (#9639)
CRZbulabula Oct 24, 2024
295a061
[Kernel] add kernel for FATReLU (#9610)
jeejeelee Oct 24, 2024
ad6f780
[torch.compile] expanding support and fix allgather compilation (#9637)
CRZbulabula Oct 24, 2024
b979143
[Doc] Move additional tips/notes to the top (#9647)
DarkLight1337 Oct 24, 2024
f584549
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA m…
litianjian Oct 24, 2024
de662d3
Increase operation per run limit for "Close inactive issues and PRs" …
hmellor Oct 24, 2024
d27cfbf
[torch.compile] Adding torch compile annotations to some models (#9641)
CRZbulabula Oct 24, 2024
a92b917
Merge branch 'main' of https://github.com/vllm-project/vllm into ibm-…
fialhocoelho Oct 24, 2024
80f94ec
Squash 5733
fialhocoelho Oct 24, 2024
9f5b55a
Squash 6357
fialhocoelho Oct 24, 2024
ce5eebb
Squash 9027
fialhocoelho Oct 24, 2024
8675a9e
Squash 9522
fialhocoelho Oct 24, 2024
e949f65
Squash 9625
fialhocoelho Oct 24, 2024
7a6d518
Squash 9631
fialhocoelho Oct 24, 2024
080973c
using adapter version merged with PR #172
fialhocoelho Oct 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ steps:
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
Expand All @@ -22,7 +22,7 @@ steps:
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
Expand Down
8 changes: 4 additions & 4 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,10 @@ docker exec cpu-test bash -c "
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
# docker exec cpu-test bash -c "
# pytest -s -v \
# tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
# tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynanmic_per_token"
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test bash -c "
Expand Down
48 changes: 26 additions & 22 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,8 @@ steps:
- vllm/
- tests/basic_correctness/test_chunked_prefill
commands:
- VLLM_ATTENTION_BACKEND=XFORMERS VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py

- label: Core Test # 10min
mirror_hardwares: [amd]
Expand All @@ -88,11 +88,7 @@ steps:
- vllm/distributed
- tests/core
commands:
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core/test_scheduler.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core core/test_chunked_prefill_scheduler.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core core/block/e2e/test_correctness.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core core/block/e2e/test_correctness_sliding_window.py
- pytest -v -s core --ignore=core/block/e2e/test_correctness.py --ignore=core/test_scheduler.py --ignore=core/test_chunked_prefill_scheduler.py --ignore=core/block/e2e/test_correctness.py --ignore=core/block/e2e/test_correctness_sliding_window.py
- pytest -v -s core

- label: Entrypoints Test # 40min
working_dir: "/vllm-workspace/tests"
Expand Down Expand Up @@ -184,15 +180,15 @@ steps:
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py
- python3 offline_profile.py --model facebook/opt-125m

- label: Prefix Caching Test # 9min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/prefix_caching
commands:
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s prefix_caching/test_prefix_caching.py
- pytest -v -s prefix_caching --ignore=prefix_caching/test_prefix_caching.py
- pytest -v -s prefix_caching

- label: Samplers Test # 36min
source_file_dependencies:
Expand All @@ -216,8 +212,7 @@ steps:
- tests/spec_decode
commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s spec_decode/e2e/test_compatibility.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py --ignore=spec_decode/e2e/test_compatibility.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py

- label: LoRA Test %N # 15min each
mirror_hardwares: [amd]
Expand All @@ -235,14 +230,12 @@ steps:
commands:
- pytest -v -s compile/test_basic_correctness.py

# TODO: re-write in comparison tests, and fix symbolic shape
# for quantization ops.
# - label: "PyTorch Fullgraph Test" # 18min
# source_file_dependencies:
# - vllm/
# - tests/compile
# commands:
# - pytest -v -s compile/test_full_graph.py
- label: "PyTorch Fullgraph Test" # 18min
source_file_dependencies:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph.py

- label: Kernels Test %N # 1h each
mirror_hardwares: [amd]
Expand Down Expand Up @@ -317,13 +310,22 @@ steps:
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models/*.py --ignore=models/test_oot_registration.py

- label: Decoder-only Language Models Test # 1h36min
- label: Decoder-only Language Models Test (Standard) # 35min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language
- pytest -v -s models/decoder_only/language/test_models.py
- pytest -v -s models/decoder_only/language/test_big_models.py

- label: Decoder-only Language Models Test (Extended) # 1h20min
nightly: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language --ignore=models/decoder_only/language/test_models.py --ignore=models/decoder_only/language/test_big_models.py

- label: Decoder-only Multi-Modal Models Test # 1h31min
#mirror_hardwares: [amd]
Expand All @@ -340,10 +342,12 @@ steps:
source_file_dependencies:
- vllm/
- tests/models/embedding/language
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/language
- tests/models/encoder_decoder/vision_language
commands:
- pytest -v -s models/embedding/language
- pytest -v -s models/embedding/vision_language
- pytest -v -s models/encoder_decoder/language
- pytest -v -s models/encoder_decoder/vision_language

Expand Down Expand Up @@ -402,7 +406,7 @@ steps:
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest basic_correctness/ -v -s -m distributed_2_gpus
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py -v -s -m distributed_2_gpus
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
Expand Down
1 change: 0 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
/.github/
/.venv
/build
dist
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/actionlint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,5 @@ jobs:

- name: "Run actionlint"
run: |
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
tools/actionlint.sh -color
21 changes: 21 additions & 0 deletions .github/workflows/add_label_automerge.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Add label on auto-merge enabled
on:
pull_request_target:
types:
- auto_merge_enabled
jobs:
add-label-on-auto-merge:
runs-on: ubuntu-latest
steps:
- name: Add label
uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
with:
script: |
github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
labels: ['ready']
})
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
6 changes: 3 additions & 3 deletions .github/workflows/clang-format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ jobs:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@f677139bbe7f9c59b41e40162b753c062f5d49a3 # v5.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -38,4 +38,4 @@ jobs:
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror
| xargs clang-format --dry-run --Werror
16 changes: 16 additions & 0 deletions .github/workflows/matchers/mypy.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"problemMatcher": [
{
"owner": "mypy",
"pattern": [
{
"regexp": "^(.+):(\\d+):\\s(error|warning):\\s(.+)$",
"file": 1,
"line": 2,
"severity": 3,
"message": 4
}
]
}
]
}
17 changes: 17 additions & 0 deletions .github/workflows/matchers/ruff.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"problemMatcher": [
{
"owner": "ruff",
"pattern": [
{
"regexp": "^(.+?):(\\d+):(\\d+): (\\w+): (.+)$",
"file": 1,
"line": 2,
"column": 3,
"code": 4,
"message": 5
}
]
}
]
}
7 changes: 4 additions & 3 deletions .github/workflows/mypy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ jobs:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@f677139bbe7f9c59b41e40162b753c062f5d49a3 # v5.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -32,4 +32,5 @@ jobs:
pip install types-setuptools
- name: Mypy
run: |
tools/mypy.sh
echo "::add-matcher::.github/workflows/matchers/mypy.json"
tools/mypy.sh 1
12 changes: 6 additions & 6 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
upload_url: ${{ steps.create_release.outputs.upload_url }}
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1

- name: Extract branch info
shell: bash
Expand All @@ -30,7 +30,7 @@ jobs:

- name: Create Release
id: create_release
uses: "actions/github-script@v7"
uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
env:
RELEASE_TAG: ${{ env.release_tag }}
with:
Expand All @@ -54,10 +54,10 @@ jobs:

steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1

- name: Setup ccache
uses: hendrikmuhs/[email protected]
uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
with:
create-symlink: true
key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}
Expand All @@ -68,7 +68,7 @@ jobs:
bash -x .github/workflows/scripts/env.sh

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@f677139bbe7f9c59b41e40162b753c062f5d49a3 # v5.2.0
with:
python-version: ${{ matrix.python-version }}

Expand All @@ -92,7 +92,7 @@ jobs:
echo "asset_name=${asset_name}" >> "$GITHUB_ENV"

- name: Upload Release Asset
uses: actions/upload-release-asset@v1
uses: actions/upload-release-asset@e8f9f06c4b078e705bd2ea027f0926603fc9b4d5 # v1.0.2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
Expand Down
21 changes: 21 additions & 0 deletions .github/workflows/reminder_comment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: PR Reminder Comment Bot
on:
pull_request_target:
types: [opened]

jobs:
pr_reminder:
runs-on: ubuntu-latest
steps:
- name: Remind to run full CI on PR
uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
with:
script: |
github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: '👋 Hi! Thank you for contributing to the vLLM project.\n Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run `fastcheck` CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your `fastcheck` build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping `simon-mo` or `khluu` to add you in our Buildkite org. \n\nOnce the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.\n\n To run CI, PR reviewers can do one of these:\n- Add `ready` label to the PR\n- Enable auto-merge.\n\n🚀'
})
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
7 changes: 4 additions & 3 deletions .github/workflows/ruff.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ jobs:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@f677139bbe7f9c59b41e40162b753c062f5d49a3 # v5.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -28,7 +28,8 @@ jobs:
pip install -r requirements-lint.txt
- name: Analysing the code with ruff
run: |
ruff check .
echo "::add-matcher::.github/workflows/matchers/ruff.json"
ruff check --output-format github .
- name: Spelling check with codespell
run: |
codespell --toml pyproject.toml
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/scripts/build.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/bin/bash
set -eux

python_executable=python$1
cuda_home=/usr/local/cuda-$2
Expand All @@ -15,5 +16,8 @@ export MAX_JOBS=1
# Make sure release wheels are built for the following architectures
export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
export VLLM_FA_CMAKE_GPU_ARCHES="80-real;90-real"

bash tools/check_repo.sh

# Build
$python_executable setup.py bdist_wheel --dist-dir=dist
Loading