Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
a61b7d5
Fix https://github.com/vllm-project/vllm/issues/17747 Bug
princepride May 7, 2025
b30790a
Update config.py
princepride May 7, 2025
8183e97
Update config.py
princepride May 7, 2025
c33870b
Update config.py
princepride May 7, 2025
88c4d8a
Update config.py
princepride May 7, 2025
074e27e
[doc] update the issue link (#17782)
reidliu41 May 7, 2025
f8890d3
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attentio…
gshtras May 7, 2025
bbb980c
Only depend on importlib-metadata for Python < 3.10 (#17776)
tiran May 7, 2025
9992ed9
[Bugfix] Fix Video IO error for short video (#17791)
Isotr0py May 7, 2025
981767c
Fix and simplify `deprecated=True` CLI `kwarg` (#17781)
hmellor May 7, 2025
3f7dcec
[Bugfix] Fix missing lora name mapping for lora without prefix (#17793)
Isotr0py May 7, 2025
29aed29
[Quantization] Quark MXFP4 format loading (#16943)
BowenBao May 7, 2025
89c4c51
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend …
Akshat-Tripathi May 7, 2025
a49056b
[BugFix] Avoid secondary missing `MultiprocExecutor.workers` error (#…
njhill May 7, 2025
5ed956b
[Core][Feature] Input metadata dump on crash (#13407)
wallashss May 7, 2025
aef8e82
[Chore][Doc] uses model id determined from OpenAI client (#17815)
aarnphm May 8, 2025
023a56a
Don't call the venv `vllm` (#17810)
hmellor May 8, 2025
4cf0313
[BugFix] Fix `--disable-log-stats` in V1 server mode (#17600)
njhill May 8, 2025
8835526
[Core] Support full cuda graph in v1 (#16072)
chanh May 8, 2025
24aa65b
Improve exception reporting in MP engine (#17800)
vmarkovtsev May 8, 2025
2752cee
[Installation] OpenTelemetry version update (#17771)
Xarbirus May 8, 2025
0aa8870
Only log non-default CLI args for online serving (#17803)
hmellor May 8, 2025
98ea9e1
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
russellb May 8, 2025
41f66c6
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs (#17071)
amd-hhashemi May 8, 2025
bfec858
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for…
Akashcodes732 May 8, 2025
58899f2
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU (#17648)
adobrzyn May 8, 2025
a69b514
[Frontend] Chat template fallbacks for multimodal models (#17805)
DarkLight1337 May 8, 2025
a3d6314
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 (#17715)
Ximingwang-09 May 8, 2025
465d912
[Bugfix] Fix bad words for Mistral models (#17753)
qionghuang6 May 8, 2025
40e407b
[Misc] support model prefix & add deepseek vl2 tiny fused moe config …
xsank May 8, 2025
30eef1d
[Bugfix] Fix tool call template validation for Mistral models (#17644)
RIckYuan999 May 8, 2025
f50aeba
[TPU] Fix the test_sampler (#17820)
bythew3i May 8, 2025
ea259b0
[Bugfix] Fix quark fp8 format loading on AMD GPUs (#12612)
fxmarty-amd May 8, 2025
997c418
[Doc] Fix a typo in the file name (#17836)
DarkLight1337 May 8, 2025
2758496
[Easy] Eliminate c10::optional usage in vllm/csrc (#17819)
houseroad May 8, 2025
f241ac3
[Misc] add chatbox integration (#17828)
reidliu41 May 8, 2025
5c5f519
Fix transient dependency error in docs build (#17848)
hmellor May 8, 2025
b15dd35
[Bugfix] `use_fast` failing to be propagated to Qwen2-VL image proces…
DarkLight1337 May 8, 2025
6930402
[Misc] Delete LoRA-related redundancy code (#17841)
jeejeelee May 8, 2025
b872f27
[CI] Fix test_collective_rpc (#17858)
russellb May 8, 2025
b0cba5e
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging (#17860)
russellb May 8, 2025
73b45f3
[Test] Attempt all TPU V1 tests, even if some of them fail. (#17334)
yarongmu-google May 8, 2025
32308f8
[CI] Prune down lm-eval small tests (#17012)
mgoin May 8, 2025
d5f3453
Fix noisy warning for uncalibrated q_scale/p_scale (#17414)
mgoin May 8, 2025
765df60
Add cutlass support for blackwell fp8 blockwise gemm (#14383)
wenscarl May 8, 2025
eecedbd
[FEAT][ROCm]: Support AITER MLA on V1 Engine (#17523)
vllmellm May 9, 2025
c094ced
[V1][Structured Output] Update llguidance (`>= 0.7.11`) to avoid Attr…
shen-shanshan May 9, 2025
a16c7b7
[Attention] MLA move rotary embedding to cuda-graph region (#17668)
LucasWilkinson May 9, 2025
c465ae3
[BUGFIX]: return fast when request requires prompt logprobs (#17251)
andyxning May 9, 2025
c5911a8
[Docs] Add Slides from NYC Meetup (#17879)
simon-mo May 9, 2025
98785f6
[Doc] Update several links in reasoning_outputs.md (#17846)
windsonsea May 9, 2025
04b073c
[Doc] remove visible token in doc (#17884)
yma11 May 9, 2025
b15363a
[Bugfix][ROCm] Fix AITER MLA V1 (#17880)
vllmellm May 9, 2025
1557adc
[Bugfix][CPU] Fix broken AVX2 CPU TP support (#17252)
Isotr0py May 9, 2025
742e633
Fix Whisper crash caused by invalid``` max_num_batched_tokens``` conf…
inkcherry May 9, 2025
d689a9b
Change `top_k` to be disabled with `0` (still accept `-1` for now) (#…
hmellor May 9, 2025
d254a36
[Misc] add dify integration (#17895)
reidliu41 May 9, 2025
a4ba0e8
[BugFix][AMD] Compatible patch for latest AITER(05/07/2025) (#17864)
qli88 May 9, 2025
6e6b598
[v1] Move block management logic from KVCacheManager to SpecializedMa…
heheda12345 May 9, 2025
2f41e2d
[CI/Build] Automatically retry flaky tests (#17856)
DarkLight1337 May 9, 2025
095d2cb
Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" …
mgoin May 9, 2025
9fec6b0
[Misc] Add references in ray_serve_deepseek example (#17907)
ruisearch42 May 9, 2025
bd81998
[Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfl…
Isotr0py May 9, 2025
734bac4
Update CT WNA16MarlinMoE integration (#16666)
mgoin May 9, 2025
41033c7
Handle error when `str` passed to `/v1/audio/transcriptions` (#17909)
hmellor May 9, 2025
b70d966
Add option to use torch._inductor.standalone_compile (#17057)
zou3519 May 9, 2025
d2d6cb5
[V1][Spec Decoding] Include bonus tokens in mean acceptance length (#…
markmc May 9, 2025
7ee2365
Improve configs - the rest! (#17562)
hmellor May 9, 2025
2011736
AMD conditional all test execution // new test groups (#17556)
Alexei-V-Ivanov-AMD May 9, 2025
5eb9495
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model (#16362)
pavanimajety May 9, 2025
560c70e
[V1][Spec Decoding] Log accumulated metrics after system goes idle (#…
markmc May 10, 2025
1153393
adjust the code
princepride May 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Llama-3.2-1B-Instruct-FP8 -b "auto" -l 1319 -f 5 -t 1
model_name: "RedHatAI/Llama-3.2-1B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.335
- name: "exact_match,flexible-extract"
value: 0.323
limit: 1319
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2.5-1.5B-Instruct -b auto -l 1319 -f 5 -t 1
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.54
- name: "exact_match,flexible-extract"
value: 0.59
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.47
- name: "exact_match,flexible-extract"
value: 0.64
limit: 1319
num_fewshot: 5
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
8 changes: 2 additions & 6 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Qwen2.5-1.5B-Instruct.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
16 changes: 16 additions & 0 deletions .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
# This script runs test inside the corresponding ROCm docker container.
set -o pipefail

# Export Python path
export PYTHONPATH=".."

# Print ROCm version
echo "--- Confirming Clean Initial State"
while true; do
Expand Down Expand Up @@ -74,6 +77,15 @@ HF_MOUNT="/root/.cache/huggingface"

commands=$@
echo "Commands:$commands"

if [[ $commands == *"pytest -v -s basic_correctness/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s basic_correctness/test_basic_correctness.py"}
fi

if [[ $commands == *"pytest -v -s compile/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s compile/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s compile/test_basic_correctness.py"}
fi

#ignore certain kernels tests
if [[ $commands == *" kernels/core"* ]]; then
commands="${commands} \
Expand Down Expand Up @@ -161,6 +173,8 @@ fi


PARALLEL_JOB_COUNT=8
MYPYTHONPATH=".."

# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
# assign job count as the number of shards used
Expand All @@ -181,6 +195,7 @@ if [[ $commands == *"--shard-id="* ]]; then
-e AWS_SECRET_ACCESS_KEY \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
-e "PYTHONPATH=${MYPYTHONPATH}" \
--name "${container_name}_${GPU}" \
"${image_name}" \
/bin/bash -c "${commands_gpu}" \
Expand Down Expand Up @@ -211,6 +226,7 @@ else
-e AWS_SECRET_ACCESS_KEY \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
-e "PYTHONPATH=${MYPYTHONPATH}" \
--name "${container_name}" \
"${image_name}" \
/bin/bash -c "${commands}"
Expand Down
103 changes: 75 additions & 28 deletions .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

set -xue
set -xu

# Build the docker image.
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
Expand All @@ -24,33 +24,80 @@ docker run --privileged --net host --shm-size=16G -it \
&& export VLLM_XLA_CHECK_RECOMPILATION=1 \
&& echo HARDWARE \
&& tpu-info \
&& echo TEST_0 \
&& pytest -v -s /workspace/vllm/tests/v1/tpu/test_perf.py \
&& echo TEST_1 \
&& pytest -v -s /workspace/vllm/tests/tpu/test_compilation.py \
&& echo TEST_2 \
&& pytest -v -s /workspace/vllm/tests/v1/tpu/test_basic.py \
&& echo TEST_3 \
&& pytest -v -s /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine \
&& echo TEST_4 \
&& pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& echo TEST_5 \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py \
&& echo TEST_6 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/worker/test_tpu_model_runner.py \
&& echo TEST_7 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_sampler.py \
&& echo TEST_8 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_topk_topp_sampler.py \
&& echo TEST_9 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py \
&& echo TEST_10 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py \
&& echo TEST_11 \
&& pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py \
&& echo TEST_12 \
&& pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py" \

&& { \
echo TEST_0: Running test_perf.py; \
pytest -s -v /workspace/vllm/tests/tpu/test_perf.py; \
echo TEST_0_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_1: Running test_compilation.py; \
pytest -s -v /workspace/vllm/tests/tpu/test_compilation.py; \
echo TEST_1_EXIT_CODE: \$?; \
} & \
{ \
echo TEST_2: Running test_basic.py; \
pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py; \
echo TEST_2_EXIT_CODE: \$?; \
} & \
{ \
echo TEST_3: Running test_accuracy.py::test_lm_eval_accuracy_v1_engine; \
pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine; \
echo TEST_3_EXIT_CODE: \$?; \
} & \
{ \
echo TEST_4: Running test_quantization_accuracy.py; \
pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py; \
echo TEST_4_EXIT_CODE: \$?; \
} & \
{ \
echo TEST_5: Running examples/offline_inference/tpu.py; \
python3 /workspace/vllm/examples/offline_inference/tpu.py; \
echo TEST_5_EXIT_CODE: \$?; \
} & \
{ \
echo TEST_6: Running test_tpu_model_runner.py; \
pytest -s -v /workspace/vllm/tests/tpu/worker/test_tpu_model_runner.py; \
echo TEST_6_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_7: Running test_sampler.py; \
pytest -s -v /workspace/vllm/tests/v1/tpu/test_sampler.py; \
echo TEST_7_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_8: Running test_topk_topp_sampler.py; \
pytest -s -v /workspace/vllm/tests/v1/tpu/test_topk_topp_sampler.py; \
echo TEST_8_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_9: Running test_multimodal.py; \
pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py; \
echo TEST_9_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_10: Running test_pallas.py; \
pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py; \
echo TEST_10_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_11: Running test_struct_output_generate.py; \
pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py; \
echo TEST_11_EXIT_CODE: \$?; \
} & \
&& { \
echo TEST_12: Running test_moe_pallas.py; \
pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py; \
echo TEST_12_EXIT_CODE: \$?; \
} & \
# Disable the TPU LoRA tests until the feature is activated
# && { \
# echo TEST_13: Running test_moe_pallas.py; \
# pytest -s -v /workspace/vllm/tests/tpu/lora/; \
# echo TEST_13_EXIT_CODE: \$?; \
# } & \
wait \
&& echo 'All tests have attempted to run. Check logs for individual test statuses and exit codes.' \
"

# TODO: This test fails because it uses RANDOM_SEED sampling
# && VLLM_USE_V1=1 pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
Loading